 So welcome all to this virtual computational biology seminar series. Today we have the pleasure to have Sylvain Pou from the Swiss Prod Group at the SIPP Swiss Institute of Bioinformatics. Sylvain obtained his master and his PhD in biology at the University of Geneva in Switzerland. Having gained a strong wet lab research background, he decided to re-orient his career to become a bio curator. This is how we joined Swiss Prod in 2002 and rapidly moved to protein biocuration in all biological fields. From 2006 to 2009, he and his team worked in defining the annotation rules that would ensure high quality of Swiss Prod entries. And since 2009, he is heading the annotation and integration department at Swiss Prod and is in charge of seven annotation programs. For those who don't know yet what is Swiss Prod, so Uniprot, KB Swiss Prod is the most widely used protein information resource and it's over 800,000 requests per month. It provides a concise description of non-redundant sets of hundreds of thousands of proteins and it's the fruit of a high quality annotation done by manual expert curators. So today, Sylvain will share with us what is the art of biocuration in Uniprot, KB Swiss Prod. Sylvain, thank you again for accepting our invitation and the floor is yours. Thank you. So, as I said today, I will describe the procedure of biocuration, expert biocuration in Uniprot, KB Swiss Prod. But first, why to talk about biocuration today? Last month, Geneva hosted the ninth International Biocuration Conference and this conference, which was organized by the SIB, provided a forum for curators and developers from different biological database around the world. More than 250 scientists attended the conference. This conference didn't take place in Switzerland by accident because over the years Switzerland has played an each-playing and leading role in the biocuration field. The SIB is providing a high number of important resources in terms of biocuration, such as ProSide, EPD, BG, NextProt, and the most important of these resources is clearly Uniprot, KB Swiss Prod. This year we are celebrating the 30th anniversary of Swiss Prod, which was born in Geneva and Swiss Prod constitutes and represents the world's most comprehensive catalogue of information and proteins. It is recognized as a good standard reference by the scientific community. In 2002, the Swiss Prod, the SIB Swiss Prod group of Geneva together with the European Brand Informatic Institute in the UK and the Proteomics Information Research in the US joined together to find the Uniprot Consortium, the Universal Protein Consortium. Uniprot is composed of different databases for different users. I will not describe these different databases in detail today. I will concentrate on the knowledge base of Uniprot, its Uniprot KB, which is composed of two sections. Tramble, which is automatically annotated and reviewed by curators, which means that information contained in the Tramble entries are either submitted by the submitters or predicted by prediction tools or submitted or imported from another databases, from another resources. Importantly, more and more Tramble entries contain automatic annotation, which is derived from rules based on Uniprot KB Swiss Prod entries. And Uniprot KB Swiss Prod, which is the review section of Uniprot, which means that in Swiss Prod, every piece of information has been reviewed by an expert curator. I will just add several numbers for Swiss Prod, as mentioned. Swiss Prod contains more than 550,000 proteins that have been expertly curated. It contains all 13 current genes for a number of key organisms, like humans, saccharomyces cerevisiae, or pondae, e. coli, or bacillus subtilis. It contains information extracted from more than 2,000 Android publications indexed in PondNet, and around 50 bi-curators currently work for the Swiss Prod database. Most of them work at the SIB Swiss Prod group in Geneva, and the Swiss Prod group in Geneva probably constitute the largest group of bi-curator worldwide. What do bi-curators do? They read, read, read. They read Android publication, synthesizing information, and integrate this information in Swiss Prod entries. Just to give a number, last year, we manually and expertly curated more than 8,000 publication indexed in PondNet. Most bi-curations have a very strong background in the Web Lab research, and are experienced biologists and biochemists. What can we find in the Swiss Prod entry? We can find information about the sequence that has been expertly checked and verified, information about the sequence feature of the protein, such as the domain, the transparent region, et cetera. It contains a lot of information about its functions, subcellar location, interaction, expression, and so on. We have to touch specific importance to all events that affect the sequence of the protein and that cannot be predicted by some...by tools, such as post-translational modifications or the regular editing event. Okay. Here is the presentation. I will first describe the expert curation procedures by giving an example, a use case, and then I will discuss the sustainability of bi-curation and the expert bi-curation in the time of big data. So, to start, I will first describe an annotation of a protein, the spastic. So, just to give some background about the spastic, the spastic is an ATP-dependent microtubule-servering enzyme that cleaves the alpha and beta tubulin, and it thereby plays a key role in all cytoskeleton events and in regulation of the cytoskeleton. As consequence, it plays a role in different steps, like the abscission step in cytokinesis or the nuclear envelope reassembly during NFAs in cooperation with the SCOT-3 complex. Two major isoforms exist for spastic, which differ by the presence of an hydrophobic and terminal sequence. The spastic also plays an important role in the new world development and especially in the axon growths and the formation of axonal branches. And mutations in the genes including spastic are associated with a major neurodegenerative disease, termed hereditary spastic paraplegia. And these diseases, which we call HSP, cause progressive spasticity and weakness of the legs, leading to paraplegia. While the role of spastic as tubulin-servering enzyme is known for many years, we only start to understand how this process is regulated. And last two months ago, a very interesting paper was published about the process that will activate the serverine activity of the spastic. These reports show that spastic specifically acts on microtubules that are polyglutamylated. Polyglutamylation is a post-translational modification, in which glutamate side chains are formed on the target proteins. And the length of the glutamate chains can vary. What is very interesting in this case is that the polyglutamylation acts as a real stat. And that serverine activity by the spastic increases as a number of glutamate chains for tubulin rise from 1 to 8. But then beyond this threshold, then the serverine activity decreases. What is very interesting in this case is that polyglutamylation of tubulin, I think that that tubulin is known for more than 20 years. The enzyme that mediates the polyglutamylation and removes the polyglutamylation are also known for more than 10 years. But we didn't know until now what was the precise role of polyglutamylation. And we only start now to understand his role in the regulation of the cytoskeleton. So just to summarize, the spastic is a tubulin serverine enzyme. It is associated with neurodegenerative disease, the eruditary spastic paraplegia. It specifically acts on micro-tubules that are decorated with the polyglutamate tails and the tubulin polyglutamylation act as a rheostat. Then the next step is how to summarize and represent this information in Uniprot KB Swiss protein tree. So this is a snapshot of the spastic entry in human. As you can see, there is a lot of information. So the curator read, extract, and synthesize all the information to give a summary of the function of the protein, function and catalytic activity, subsoil location, and all these. Every annotation field is evidence to give the source of information. And the source, every source is associated with an evidence code to give which, to indicate which evidence exists for this evidence. In this case, it is based on some experiment, but it can be based by similarity or by prediction program. In the future, we would like to add more granularity to the evidence and further and add the evidence at the end of sentence to give more precision to the annotation and for more usability and more trustability that user can really trace from which article is coming some information. For the moment, we try, we keep trace of this information by adding some permanent identifier at the end of sentences, but this will be replaced by real evidence stats in the future. This is a case of free text annotation. 10 years ago, most information in Swiss patentries were presented as free text, but free text information is not sufficient for a number of users, and it cannot be read by a machine, and it is much more difficult to retrieve information from this. So we are more and more moving to some structure information, and more and more annotation fields contain structure annotation. And for example, the function annotation is also present in the structure format. Swiss patentries, Uniprot KB Swiss patentries is part of the group consortium, and manual curation of group terms based on experiments is part of the curation process, which means that when Swiss patentries read and curate an article, they curate in Swiss patentries and in the Go database in order to have both informations. And over the years, Uniprot has become by far the largest contributor of the group consortium. As you can see from this snapshot taken from the Go website, from the statistic page on the Go website. We also display sequence features for the protein, and in this case, I mentioned that the antagonists of the protein contain an hydrophobic region. This is an interesting case because this is predicted to form a transformable region, but it doesn't form a transformable region because in fact, it's from an intramangrain helping loop burying into the brown grain which doesn't cross it. And we have a specific annotation for intramangrain to describe this case. The sequence features correspond pretty well to what is shown in articles. So if you look in the upper part of the slide to a figure taken from a review article, you can see that the region described in this figure are represented in the entry. So the intramangrain region at the antagonists, the meat domain, the mitrotubule binding region or the AAPS region which is required for the mitrotubule setting. We also go for more precision for information. More and more articles now are describing some functions that are either product-specific or isoform-specific. When we have such information, we try to save this information in specific fields in order to allow the retrieval of this information. And in this case, I mentioned that there are two major isoforms for this past team which differ by the presence of the antagonist intramangrain. And the longest isoform which contained the intramangrain region is specifically involved in the lipid metabolism, which is not the case of other isoforms and localized to specific subcellar locations we therefore indicate that in specific fields while it is not the case for the other isoforms. We have, of course, annotated the post-translational modifications on the tubulins. So we describe the polyglyptamylation and describe the role of the polyglyptamylation on the severing of the tubulins. We also describe the other modification on tubulins because the tubulins are highly modified. For example, in human, the alpha and beta tubulins are monoglycylated, which is interesting because in other organisms, in mouse or rat, they are polyglycylated. And in the human entry, we explain that it is monoglycylated and we explain why it is not polyglycylated because of the absence of an enzyme in human, which is present, an enzyme which is present in another organism. We also describe the acetylation of the tubulins. Of course, when we know the positions of the modification, this is indicated. And every PTM is described in control vocabulary, establishing collaboration with the Resid database. And here you can see some polyglyctomylation of one residues. And it's quite interesting because we know that the c-terminus of the tubulins are highly polyglytamylated, many glutamate residues, but the precise positions are not well known. And in the case for these tubulins, only one position is known. This is indicated. And as this position was proven in mouse, we give the source of information for the mouse entry, which has the experiments. We attach specific importance to annotation completeness and consistency. And when curating the spastic enzyme and the tubulins, we check that the enzyme that may create the polyglyctomylation and remove the glutamylation were up to date. And we updated this enzyme when required. We also report the association with the genetic disease. So genetic disease are described in a structure format, which is based on a rheumatoid when available, and it contains cross-references to homin and mesh terms. We also re-protein as a publication that describes a variation that affects the function of the protein. And in this case, we report many variants that are associated with the spastic paraplegia. And in Switzerland, more than 30,000 variants are associated with a genetic disease. In this case, it's quite interesting to look at the localization of variants that affect the proteins on the protein. And for this, you can use the feature of UFO that brings together the protein sequence features in one compact view that is present for all the Uniport KB Swiss protein trees. And if you look in this case for the spastic, you can see that the variants associated with the disease that are highlighted in red in these figures, while the polymorphisms are highlighted in green. You can see that most disease-causing variants are associated with the region responsible for the severing activity. And you can also see that two variants are associated with the adrophobic region. When available, we also describe the effect of variants on protein function. As you can see in this example, the consequence of the variant is described in free text. As you can see here, free text can be quite difficult to pass. And for this reason, we decided to restructure this information into structure and control into a structure format using a combination of control vocabulary. We use a combination of vocabulary, vario, broad terms, and then identify the KB term or the Uniport term. So in this case, for example, the sentence that described that this variation averages a binding to the tail of beta-3 to B can be summarized by using vario terms and Uniport terms. Or the abolition of the microtubile serine activity can be summarized also by vario terms and good terms. Again, every information is evident with the source of information and the evidence that supports this. Currently, more than 7700 variants are associated with functional characterization data in Swiss Pratt, of which 4100 have been standardized into control vocabulary. And we aim to finish the standardization and make the public for the users. Okay. Now I describe the purely expert and manual curation. And now I'd like to briefly describe some automatic imagination because the manual and automatic annotation process are really tightly linked. And we try to use the information in Swiss Pratt entries to build rules and to build automatic annotation. And the same curators try to use this information to make the best use of our resource. So this past thing is a well-conservant enzyme which is present in most metazolem and it is going to characterize in a number of organisms in human, mouse, but also in zebrafish, drosophila, or sealagans. So we use this information to build a family profile and use the information in Swiss Pratt to produce a rule that generates some high-quality automatic annotation in trombone, the unloaded section of human Pratt kb. The advantage of this system is that the spastic entries from the newly sequenced proteome that enter into the database will be automatically annotated. So this provides a way to automatically annotate the new genome and new proteomes. Okay. To summarize the part on the expert curation procedure. So curation of several people according to the standard unique network flow generated the complete re-annuation of spastic, a bit of the PTM's information in more than one Android tubulin entries. The update of annotation of funds run that may get or remove the glutamylation modification. And the information of spastic has been used to generate a family rule, a map rule, and to generate automatic annotation. A frequent comment that is made about concerning expert curation is that it is time-demanding and expensive. And this last year, the question of the sustainability of expert curation has been frequently raised. And this question is extremely important in the context of exponential growth of the barrel medical literature. Last year, more than one million paper were indexed in PubMed. When we compare these numbers to the 8,000 papers that we curate every year, we give the impression to be completely overwhelmed by the number of publications and that the expert and manual curation cannot keep up. However, this is quite misleading. And I will explain why. The first thing is that we do not aim to curate old published paper. And we select a representative subset to provide a complete overview of the available information. Only publications that provide added value are added in Swiss potentries. All publications are read in detail, the full text is read, and they are fully curated. If we take the example of the spastic, and on 370 papers indexing PubMed for spastic, only 94 have been used for annotation. 19 were new articles and 75 were already present in the Swiss potentries. As of 276 publications were either much-reven for annotation or presented weaker or redundant evidences. For example, this paper is describing the value that is already present several times in the entry. We write this paper and then realize that it doesn't provide any additional information. So we decided not to create it. There are also a lot of reviews, review article. This article was read. We like reviews because they are extremely useful to prioritize publications. But we don't favor, we don't curate them because we favor publication that report directly evidences. Again, this article was read but not selected. In fact, all these 276 publications, all these papers were evaluated. In some case, evaluation took less than one minute because it was clear from the abstract that it's not curatable. But in the other case, we had to read the article and it took one hour. So the evaluation step took from one minute to one hour. And you can find all these other publications in the additional bibliography section of the unit entry. This passing example shows that we read and read much more paper than we integrate in Swiss print. However, the problem is that we have not tried this information until now. And that we cannot quantify the total amount of literature that is evaluated every year. So the first question is, how many articles do we evaluate every year? There is a second part of the question. Is that there is a large part of literature index in PubMed that is not curatable and not relevant for annotation in Swiss print. For example, around 20% of article index in PubMed are not in English. It's quite a high number. A lot of articles do not report scientific results. Life science is huge with many fields like ecology or etiology and many articles are not dealing with any protein or gene science. And most surprisingly, a number of paper are not related to life science at all. And even if you look to molecular biology journals, a lot of articles in these journals are not curatable for Swiss print. If I take the issue of cells that published the paper about the political contamination, the same issue, many papers were not curatable and more than half of these papers in these issues were not curatable. Which brings the second part of the question is which part of scientific literature is curatable. To answer these two questions, how many papers do we assess and which part of literature is curatable, we developed a collaboration with the group of Z-inglu at NCBI. They developed the Puppetator tool, which is a text mining tool, and they adapted the tool for the Swiss Procurators, the Portuguese Swiss Procurators. And this tool was adapted to assess the number of papers that we evaluate every year. The tool is pretty similar to Punded. Core results are exactly the same as in Punded. But one difference is that you can also access to the abstract as in Punded. One difference is that a number of entities like protein or gene names, species names, chemical entities are highlighted, which facilitates the process of evaluation. And as said before, the tool was adapted for the UKB Swiss Procurators and in order to classify these papers. So the papers that are curatable for Swiss Procurators are tagged as curatable. Some papers are tagged as not-priority. This is the case when a paper brings some additional information which is not outstanding and does not constitute a priority for Swiss Procurators. And, at least, there are the papers that are considered not curatable. And, in this case, we can also specify why the paper is considered as not curatable. In this case, this article is in Chinese and it's considered as an autoscope. We developed two different approaches to assess the literature triage. For the first one, in the first approach, four curators from different annotation programs are running the test over a six-eight-month period. During this period, these curators continue to select and curate publication as they currently do. The only difference is that they use Pectator instead of PubMed to select publications. And that they keep trace of all evaluation of paper. They tag all the curatable papers and the non-curatable papers and describe why these papers are not curatable. So, this test started three months ago. During these three months, the curator evaluated a high number of proteins and a very high number of papers. And in three months, more than 2,300 article index improvement have been evaluated. Here are the preliminary conclusions about these curation workflows. We can see that in 2,300 papers, only 27% of them are curatable, of which 43% are already in Switzerland, which means that only 15% of these 2,300 papers are curatable and only paper curatable. A high proportion of articles are out of scope for different reasons because they are not written in English or because they describe the protein as a marker or because they describe the regulation at the DNA level and so on. A pretty high number, 10% of articles concern some reviews. In fact, many articles are reviews or comments. And 10% of articles report some redundant information. This number is especially high for humans for the paper dealing with a genetic disease because in this case, there are many studies in different populations. This analysis is very useful but does not give any information about the proportion of format that is curatable. This because for the curation workflow I described, we use the gene and protein names to select the publications. In fact, many articles do not cite gene or protein names in the abstract. To try to assess the number of papers that are curatable in PEMED, we selected 13 journals, high-impact journals, that we frequently, very frequently curate in Switzerland. For these 13 journals, we specifically look at the table of content every week and we determine without all these papers saying whether they are curatable or not curatable. Again, when they are not curatable, we classify them into categories. This is also extremely useful to prioritize and to identify publication for prioritization and then curatable publications are distributed to the curators. In Trinidad, more than 2,700 papers have been evaluated and here only 13% of papers are considered as curatable, which is quite surprising when you see the list of journals because these are really molecular biology journals that contain a lot of information about proteins and a very high number of papers are out of school. There are different reasons for that. One of them is that many articles describe some biological process that are not dealing, that do not deal with the protein of the genes. For example, this year, many articles were published about the Zika virus outbreak, but only one paper was describing the proteins. All of those were describing the outbreak and the consequences in human. There are also many... there is also every error term and every correction of article and there are also many articles about the protein engineering or the technology and so on. And in nature and science, that are general journals, many articles are also indexed that describe some politics, some environmental science, ecology and so on. Okay, so, what can we conclude from this literature triage activity? The first thing is that the number of papers in the number of publications is observed. The relevance for uniprot is very heterogeneous. And as was shown here, a lot of publications are out of scope of uniprot or do not present much added value. A growing problem also is the number of publications that report some weaker evidences. Because many papers with new technology publish a list of candidates without giving some detailed information about the protein or the gene. And this is not some information that we want to capture in Switzerland because we want to capture only growth standard knowledge on which we can rely on. So it also means that the major challenge in exploration is currently the literature triage step. And that the major task of curators is to identify and detect publication that will provide real added value for the users. And also that it's important to calculate the number of publications but it's more important to select the appropriate publications. By using the data tool, the text mining tool, it also shows that we can both identify and prioritize publication for curation and estimate the total number of publications that we evaluate. And it also shows that it's quite paradoxical but with the increase of the number of publications, expectation is now more needed than ever in order to sort the... to sort the read from the shaft and only provide the growth standard information for curators. And for this, we really need a mix of expectation and good text mining tools in order to provide this. And like the plus list, it shows that we facilitate 8,000 publications per year which will definitely read and evaluate a much higher number of papers. And if I extrapolate the numbers, the preliminary numbers to the whole group of Swiss proc curators, we estimate that we evaluate between 50 and 70,000 papers a year which is a much higher number. Okay. To finish, I would like to thank the Unicrot team at SID, EBI and PIR. Unicrot is funded by a number from the United Nations, especially the Swiss federal government, the National Institute of Health and EMBL. And I thank you for your attention.