 Welcome to this webinar on expert biocuration. You're going to discover Uniprot, and more precisely, the reviewed section of the Uniprot knowledge base, SwissProt, and what we mean by expert biocuration. The second part of this webinar includes a live demo showing the different steps of the expert biocuration process using Uniprot's dedicated biocuration platform on a newly characterized family of proteins. First, what is Uniprot? The mission of Uniprot is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Uniprot proposes different resources, but here we will be focusing on Uniprot KB, a reference protein knowledge base which contains protein sequences associated with functional information. Uniprot KB is composed of two sections, Uniprot KB SwissProt, which contains biological information reviewed by expert biocurators, and Uniprot KB tremble, which contains automatically annotated protein sequences. For simplicity, we will use the terms SwissProt and tremble from now on when we speak of these database sections. In the SwissProt section, the biological information is extracted from the scientific literature by the expert biocurators or it comes from curator-evaluated computational analysis. The primary goal of this biocuration process is to provide an accurate and comprehensive representation of biological knowledge on a given protein and to make this data readily accessible to the life science community. We encode biological knowledge in forms that both humans and machines can understand and exploit. What type of information can be found in a SwissProt record? In the next few slides, we'll have a quick look at the elements present in the entry for gene Newt T12 from mouse. This is also the entry on which the demo in the second part of this webinar will be based. A SwissProt record can contain, for example, one or several protein sequences which are the products of a given gene. In this example, the mouse Newt T12 gene produces two different protein sequences as the result of alternative splicing. An entry can also contain information on the family the protein belongs to, on the biological function and if the protein is an enzyme on the catalytic activity, on the different cofactors required for the enzyme's activity, on the subcellular location of the protein and many other aspects that describe the protein, such as post-translational modifications or proteolytic cleavage, involvement in disease, etc. The source of the information is provided whenever possible via evidence tags. Note that the annotation in this entry, before the major update described in the second part of the webinar, was primarily based on similarity to the human orthologue. Once the protein has been fully updated at the end of the webinar, most annotations will be based on the experiments described in the scientific literature. In the SwissProt section of UniProt KB, all information presented in an entry has been curated by expert curators who are usually biologists or biochemists. Let's have a closer look at what this process of expert biocuration actually involves. Expert biocuration in SwissProt consists of a critical review of experimental and predicted data for each protein, as well as expert verification of each protein sequence. This process requires a combination of human intelligence, well-designed software tools and advanced computational methods for literature identification and triage. To give you an idea of the scope of the body of literature on which SwissProt is based, biocurators have gone through almost 240,000 carefully selected publications, which means they have critically extracted and reviewed information from these publications to provide summaries of the knowledge available for every single protein. A well-defined expert curation process is essential to ensure that all entries annotated by expert curators are handled in a consistent manner. SwissProt curation is performed by expert biologists using a wide range of tools that have been iteratively developed in close collaboration with biocurators, integrated in a dedicated biocuration platform. Documentation on our standard operating procedures is available via the help section on the UniProt website. Now let's go ahead and discover the different steps of this biocuration process with an example. We will show the process of updating the SwissProt record from the mammalian enzyme encoded by the gene Newt T12, according to a new publication. This publication shows a new activity that had previously been known in prokaryotes but not yet in eukaryotes. Some biological background. Eukaryotic mRNAs possess a well-known cap at their five prime end called the N7 methyl guanosine cap that prevents mRNA from degradation. Another cap was recently discovered also at the five prime end of mRNA, the nicotinamide adenine dinucleotide NAD cap. NAD-capped RNA can be decapped in a process known as RNA denading. The cleavage can occur at two different sites, as indicated by the blue arrows, depending on the biological context and the enzymes involved. The cleavage at position one is performed by an enzyme called DXO, an enzyme which has been described in eukaryotes. The cleavage at position two is performed by an enzyme called Newt C, which until recently had been characterized only in prokaryotes, in E. coli in particular. Several recent publications report studies of a protein with an activity similar to Newt C in mammals. They found that the enzyme Newt T12 is the mammalian ortholog of the prokaryotic enzyme Newt C. It belongs to the Newt X hydrolyze family, Newt C subfamily. It has denading activity on NAD-capped RNA and can hydrolyze free NAD. Let's start our live demo on the expert biocuration of Newt T12 together with Dr. Sylvan Poo from the Swissprod group of the SIB, Swiss Institute of Bioinformatics. The curation editor is loaded together with the latest version of all controlled vocabularies and other related datasets required for the biocuration process. This is the curation editor homepage. The main window supports curation and it is here that we will see our Swissprod entries once they are loaded. The left side panel provides access to protein or nucleic acid sequences and different supporting documents such as alignment results or nucleotide to protein translation results. The icons at the top provide access to different commands and biocuration tools, some developed in-house, some external, for example, blast for similarity searches, our in-house module for integrated sequence analysis and prediction, various sequence alignment tools, and pubtata, the text mining web tool, which helps us select publications. All together, about 340 different commands are available, which can be executed via their icons or for the experienced curator via keyboard shortcuts. Now we are ready to update the Swissprod record corresponding to the mouse's new T12 protein. According to a carefully selected publication, our aim is to update the relevant biological information in the corresponding existing entries from mouse and other mammalian species, according to the information found in this publication. The curation editor gives access to the latest internal version of the Unipod Knowledge Base. Using the menu, we can search for all entries which contain the term new T12 in Swissprod. We select and load the entries corresponding to the mammalian proteins, which are derived from the gene new T12. Now that we have loaded the entries, we will start to update them, which brings us to the view of the entry. We will focus on the mouse protein encoded by the new T12 gene. As we can see, this representation is slightly different compared to the Unipod website entry view. However, it does look very similar to the text format, which can be downloaded from the website or from the FTP server. Entries are composed of different sections, and the lines in this representation are preceded by a two-letter code defining a section. We can see the section containing protein names and gene names, the section containing publications which were used to annotate the entry, the so-called comments section, where lines are preceded by the letters cc, and which contains paragraphs about the protein's function, catalytic activity, among many other subjects. The section with the cross-references, which give access to relevant information in almost 200 different external databases. Most cross-references are already present in the entry, imported by automated procedures, but others can be added or modified by the curator. We can access cross-reference data directly from the editor with a simple click. There's also the section with the so-called feature table or FT lines that contains annotation related to the protein sequence itself, such as the positions of domains, metal binding sites, substrate binding sites, or post-translational modifications. And last but not least, the section which contains the protein sequence itself, which remains in sight all the time, even as we browse through the rest of the entry. The first important step of the biocuration process is to check and validate the protein sequence, because a lot of annotations are linked to amino acid positions in the sequence. For some model organisms, such as human and mouse, a consortium exists to ensure a consensus sequence corresponding to the genome. The CCDS database for consensus coding sequence is the result of Uniprot's close collaboration with institutes such as EBI, NCBI, UCSC, and Wellcome Trust's Anga Institute to establish a common clean set of human and mouse protein sequences which must match the underlying genomes. In order to validate the mouse's new T12 protein sequence, we therefore have to check that the protein sequence is part of the consensus CDS protein set. A CCDS cross-reference is already present in the cross-reference section of our entry. This means that the mouse sequence matches the CCDS sequence. No additional action is needed in this case. The major added value of expert curators is in literature curation, a process which cannot yet be handled by machines. The first important step is publication triage, where we select the publications relevant for the curation. We do not aim to curate all published papers. Instead, we select a representative subset to provide a complete overview of available information. To do this literature triage, we use a tool developed at the NCBI called Poptator. When launching this tool from within the curation platform, the literature mining results corresponding to the protein and gene names are seamlessly displayed. By using Poptator, we can find 14 publications associated with new T12. Many of these are not relevant because they describe high throughput analysis or experiments performed with artificial substrates, and others are out of scope for uniprot curation. In the end, three publications were relevant for the biocuration of new T12 as a denading enzyme. Numbers 3, 6 and 9 in this list. The publication, which is going to be curated now, is the paper from Nature Chemical Biology. It shows that mouse new T12 has mRNA denading activity. The authors also study the 3D structure of the mouse protein. We will add this paper to the mouse entry in the reference section. We import the paper by using the get pubmet command with the corresponding pubmet identifier. We can now copy-paste the citation in the required format into the mouse entry in the reference section. The new publication is added as reference number 5. We will now update the biological information present in our entry according to this publication and create a short summary of the main findings of the paper in both human and machine readable forms. One of the human readable parts is a short textual summary of the function of the protein, in this case the fact that the protein is an mRNA decapping enzyme. We use so-called evidence tags to indicate the source of the information. In this case the information comes from experiments described in reference 5. We select exp for experimental as evidence tag and add a link to reference 5. A statement about the previously known activity of this protein is already present in the entry, inferred by similarity to the human entry. After having described the function in a human readable form, we now also summarize it in a structured format by using the genontology or GO. Uniprot is a major contributor to the genontology consortium and manual curation of GO terms based on experimental data from the literature is part of the uniprot curation process. In this case, NAD capping of mRNAs was not yet described in GO and we therefore contacted the GO team to request a new term to cover this activity. The term was created and is now part of the official genontology for molecular function and it was inserted in the entry. Now we modified the recommended name of the protein to reflect its activity. We have decided to name it NAD capped RNA hydrolyzed Newt T12. The protein was not explicitly named in the article and the name was attributed by the curator. Therefore, the evidence we associate is curated. We also specify a commonly used short version of this name. The evidence we associate is curated. We also add a partial enzyme classification or EC number 361 to indicate that the protein belongs to the class of enzymes called hydrolysis class 3 acting on phosphorus containing anhydrides subclass 6 1. Note that the previously used name has been kept as an alternative name. The next section to be updated is the catalytic activity section, which is in structured machine readable format. We use RIA, a comprehensive expert curated knowledge base of biochemical reactions to describe our reactions. To refer to reaction participants, RIA uses kebbi, a widely used ontology for chemical compounds of biological interest. According to the paper, Newt T12 is a denading enzyme that hydrolyzes NAD capped transcripts. At the time of this update, the denading activity was not described in any resource and we therefore contacted RIA curators to request the missing reaction. The new reaction was created and we are now able to include it in the mouse entry with the identifier 60876 that was attributed to this reaction. In particular, we can specify the direction of the reaction, which is left to right in this case and we add the experimental evidence tag. The authors have determined the 3D structure of the protein and provide information about the quaternary structure of the protein, which is a homodimer. This type of information is stored in the subunit section. As before, the new quaternary structure information homodimer is tagged with a link to the source and to the fact that there is experimental evidence. The paper provides information about the cofactors. The cofactor section is in structured format. All chemical compounds are mapped to kebbi terms, just like previously in the section on catalytic activity. The previous version of our entry already had cofactor annotation, inferred by similarity from another entry. It had however been unclear if the divalent metal cut iron was manganese or magnesium. Our new paper now shows that magnesium is indeed the cofactor, allowing us to update this information. And we add the experimental evidence tag. We can also add the information that the protein binds three magnesium ions per subunit. The protein is also in complex with zinc, and we can add zinc as a cofactor, selecting it from the list of cofactors imported from kebbi. More precisely, one zinc iron is bound per subunit. And we add the experimental evidence tag. We now proceed to the annotation of sites of interest within the sequence. Which can be done with the help of the crystal structure of new T12 in complex with a substrate analog, with three magnesium ions and one zinc iron, as described in our paper. According to this data, we can annotate the positions of the metal and substrate binding sites. We check that the positions of the binding sites described in the article correspond to the ones present in the structure, and we pay attention to only annotate those binding sites which are biologically relevant. This is done with the help of the PDB sum resource from the EBI. In this 3D structure, we see two zinc ions in blue. There are two, one per subunit of our protein, which we've previously seen to form a dimer. We click on one of the zinc ions and see that it binds to four sustained residues at positions 284, 287, 302, and 307 of the protein sequence. We can annotate these four metal binding sites, again indicating the source as reference 5. We proceed in the same way for magnesium and substrate binding sites. Now that we have added a number of details to the mouse entry, let's have a look at the new entry compared to the original version. We quickly browse through a view that shows these differences, seeing the updated name, the inserted publication, the updated paragraph to document the protein's function, the new catalytic activity, cofactors, the subunit section, describing the quaternary structure, and the metal binding and substrate binding sites. The next step is family-based curation propagation. Annotation content can be propagated to other mammalian species using our propagation tools. Experimental information can be spread among orthologs, because researchers experimentally study different aspects of a protein in different mammalian species. To create a complete and consistent portrait of a protein, we carefully assess which information can be propagated, considering that not all types of information can be propagated to orthologs. We can select our template entry, mouse in this case, and our target entries, and select the topics we would like to propagate. We start with the free text function statement. In Orangutan, Bovine and Makak, we replace the existing comment, which had already been inferred by similarity, by the more up-to-date one we just curated in mouse. The human entry already had an experiment-based statement about protein function, which we don't want to erase, and we therefore decide to add the function from mouse, inferred by similarity, to the existing experimental statement. Once the information has been propagated, we can see in the human entry that the new activity has been added, and the information is inferred by similarity from the mouse entry. The previously annotated function that was backed by experiments has been conserved. We move on and propagate the new and updated catalytic activity, subunit, and cofactor statements accordingly. We can also propagate the updated positional annotations, which are sometimes called features, based on a multiple sequence alignment. On the right-hand side, we have the alignment results, and the features that are eligible for propagation, which are those that are conserved in the sequence. We select the sites that we want to propagate one by one. They are highlighted in the alignment. Once the features have been propagated, we see that the annotation is based on the mouse ortholog, citing the accession number of the mouse entry. When we now look at the human entry as a whole, we can see all the added and modified sections highlighted in color. Most modified sections have been annotated as inferred by similarity from the mouse entry, but as discussed previously, the function section has a mixture of experimental and similarity-based sources. The different entries are now ready and can be submitted to quality assurance curators who will check the annotation before the integration of the updated entries into the internal SwissProt database. Here is what it looks like on the website. Most of the new annotations are in the function section. The free text summary of the function, the new catalytic activity, cofactors, metal binding and substrate binding sites, and the newly created go terms. We'll proceed similarly for the other two publications initially selected for curation. This webinar was a simplification of the process, where we have chosen to illustrate the literature curation for a single publication In a real life case, however, we would simultaneously review all relevant publications, create a summary and synthesis. Then we would spread a combined view of the new findings across the different sections of the entry, before propagating them to entries describing orthologs. We hope you have found this video useful, and will join us again for future webinars, where we plan to address other aspects of the curation process. As always, please don't hesitate to contact our help desk, if you have any questions, updates, comments or bug reports.