 So, welcome everyone. I'm Monique Tsang and on behalf of the CIB, the Swiss Institute of Bioinformatics, I'd like to welcome you to this course on Mining Enzyme Data in Uniprot KB using RIA. And the trainers today are Anne Morga, Elisabeth Gastreiger, Marie-Claude Blatteur and Paris Mansal. So Mining Enzyme Data in Uniprot KB using RIA. So the course this afternoon will be divided in three parts. I will start with the first part on describing chemical resource that involves chemical information. The second part on the Enzyme sequences will be provided by Marie-Claude. And these are the two big parts of the course and we will finish by a short introduction on Semantic Web, RDF and Sparkle. As Monique said, if you want to ask some questions, we provide a Google Dog with this teeny URL, but it's also in the chat. So you can ask a question if something is not clear. Okay, let's start. So just as an introduction, how do Enzyme works in order on that we get all the same basics. So we will show you just a short video that is provided by the training course at PDB, the Protein Data Bank, and they have a very, very nice video. So this is, we will see how they describe Enzyme. Every single second inside every living cell, thousands of chemical reactions are taking place. These reactions are performed by enzymes. An enzyme is a protein that catalyzes a chemical reaction. It initiates the reaction, speeds up the reactions progress, and makes sure the outcome is always the same. These enzymes often work together to form longer pathways, such as the citric acid cycle, which is a series of chemical reactions used by cells to generate energy from carbohydrates. The essential tasks of life, such as metabolism, protein synthesis, and cell renewal and growth are all regulated by enzymes. The life sustaining power of enzymes lies in the fact they catalyze reactions in mild conditions of pH, temperature, and atmospheric pressure. The rates of catalyzed reactions are millions to trillions times faster than those of the same reactions uncatalyzed. To speed up a reaction in the absence of enzyme, additional energy would need to be provided as heat, which jostles the substrates and occasionally provides enough energy to trigger a reaction. In the course of most reactions, an unstable and highly energetic transition state is formed as the substrates are transformed into products. It acts as a template for the reaction. By need to its substrate and holding it in the proper position to form the product. An enzyme also surrounds its substrate with reactive groups that stabilize the transition state, making it easier for the reaction to occur. So I will stop the video here because after it's an example of one specific enzyme and we will show this example in during the practicals. So what we will see this afternoon is the enzyme ecosystem at the Swiss port, the Swiss port group. So we are, we are developing several resources developing that means we are creating and developing functional annotation resources in different areas. So this afternoon, we will see, I will present reaction on the classification with the number and the KB anthology. And after this section, Marie Claude will present the uniprot resource. So I divided this chemical information section in different parts, we will start by see what is the enzyme classification, then I will present you the rare resource, the KB resource, how we can perform search in in rare and the different parts of the rare resource. So the enzyme classification is divided by the nomenclature committee of the international union of biochemistry and molecular biology and see a UBMB you can see sometimes this, this acronym, and these committees in charge of classifying enzymes according to the reactions they catalyze. In the group, Christian Axel Sen, one of our expert bio curator is part of the this enzyme nomenclature committee. There are seven main enzyme classes in this classification. So from one to two seven oxido reductase transferase hydrolysis lyases isomerases ligase and translocate translocate. On this part, you have the main reaction that are catalyzed by this by this reaction by this enzyme classes, excuse me. Okay, so the first on the commission, it was in 1961, so six, more than 60 years ago. They devise the system for classification of enzyme, and this system must also serve as a basis for assigning code numbers to to the enzyme so this is the origin of the EC number this code numbers predicted by EC the C number content for separated by dots. What is important in the enzyme classification is that it's a classification with a fixed depth, we have only four levels in the enzyme classification. So we have see the first level, then, for example, in the main class of transfers, the subclass to that one is the enzyme that are transferring one carbon atoms in the subclass. This is the subset of enzyme that are acting on hydroxymethyl formid for me on related molecule. And finally, what we call complete EC number is a specific class of enzyme that catalyze a given given reaction. Explorant is the resource but contain only the IUBMB enzyme nomenclature. So you can see here two examples of EC number entry. So you have EC number you have information on the different names and synonyms. And what is of interest for us is the reaction part, but you can see that IUBMB provide the reaction only as a text definition so it's a semi structured text, like this chemical equation, but we also have information in the common section for instance, in this case, they say that the day I need to hydroxymethyl transfer as 2.1.2.7 is also acting on to hydroxymethyl serine. So I insist on this because there are several resource that are describing the IUBMB enzyme classification. So in the group beyond the time at the at XPASY but you have also Brenda or keg or metasype. And this description in the mix of structured and natural language is subject to different politic of editorial board of the different resource that explain why we don't have the same content in these resources. Sometimes, an EC number, an enzyme class is described by the reaction totally in the natural language, in particular the hydro lasers, the proteas. So the reaction is described as release of an end terminal amino acid from a given peptide, etc. And in this case, it's not possible to translate the reaction into an explicit form. So all the reaction of the enzyme classification that are in enzyme at XPASY. If it was possible, I've been translated into rare reaction that I will present just just after. What we can say is that 88% of the EC numbers are linked to one or several reaction. We are the rare resource has been at the region designed to represent to have an explicit representation of this chemical reaction. But today, more than, if more than 52% of reaction are linked to EC number, we can see that we have a great number of reaction that are not yet classified to be enzyme classification. So here is an example of a different kind of enzyme catalyzed reaction. And we can see that this reaction can involve different type of participants. So I call it the participant according to their type. In this first reaction, all the participants are fully defined small molecule. In the second example in blue, we will have some abstract compound D alpha amino acid, we can have also macro molecules like this D alanine carrier protein that is involved as participant of the reaction and not as the enzyme. We can also have nucleic acid like this enzyme class that catalyzed modification of tyranny. Similarly, we can have some reaction that involve polymer, we can have some polymerization reaction, but on different type of polymer compared to proteins and nucleic, nucleic acids. So here, more in detail, in real, we will represent this, you have the structural representation of this, this reaction. So there's two reaction are related because D alanine is a specific form of a D alpha amino acid. But when we have this reaction, we will have a molecular structure with an air group in the case of this specific enzyme, this group is a metal group. And we have the same for the period form, which is a two oxo oxo carboxylates. So in the case of reaction involving macro molecules like proteins, we summarize the macro molecule to the functional group that is involved in the interaction. So this D alanine carrier protein in its active form. We as a serine residue that has been modified with a functional group to accept some D alanine residue. So the serine group has been modified with a phosphate group and with a pentatein functional functional group. On the product side of this reaction, we can see that this part of the molecule has not been modified but the D alanine now is attached to the sulfur atom. So this kind of representation allows us to be sure that the reaction involving macro molecules are chemically balanced. As I told you, in some cases, 12% of the C number, it's not possible to link them to reaction. So in that case, the information in uniprot for instance will be still represented by by text. The enzyme classification is one of the few resources that has a linear growth, like we have for instance, all the lot of resource today biodecal data resource have an exponential growth. We have a linear growth, but we have to know that knowledge evolve over time. So here, you can see that this is the status of the number of this number per class, but we can see that some EC number have been transferred to a various EC number or some EC number have been deleted. So it's very important to have this information up to date. So in the group, the enzyme resource is up to date with the NCI UBMB recommendation, but it is synchronized with uniprot and real release. In that way, we ensure that all the data are consistent. And it's a void for instance to have broken link because an EC has been transferred to to another AC or has been has been deleted. So that's all for the enzyme classification. Now let's go to Rhea, take a tour. So what is Rhea? Rhea is an expert curated knowledge batch of chemical and transport reaction of biological interest. Rhea is built on on KB. So all the reactions are chemically balanced for mass and charge. So Rhea is non redundant. It is evidence by literature, a citation, it covers all organisms. So it really is organism agnostic. It's a context independent. So we, we do not provide any information on cellular location, etc. So maybe this afternoon, but Rhea is used as a control vocabulary for functional annotation in uniprot KB gene ontology and some other resource like like sleep. But I think I forgot to to to mention it's another project of group, which is related to lipidomics or metabolomic for for lipids. So Rhea is referenced in several projects like KB, obviously, but also react to maybe enzyme portals, metabolites, or metabolic resource like meta-netics or meta-psych. The scope of Rhea is enzyme annotation, genome scale metabolic network and omics related analysis. So here the data are available on a website. I forgot. I can't. Okay, I can go back anyway. So in Rhea, the different kind of reaction we have, we have biochemical reaction. So enzyme, most of our reaction are enzyme catalyzed reaction, but we can also provide spontaneous reaction. That occurs without any enzyme. And it's important to have this reaction for genome scale metabolic network or omics related analysis to fill the gap in some pathway for instance. And we also have transport reaction. So in the case of transport reaction, you can have simple transport reaction or you can have transport reaction coupled to ATP hydrolysis as it is presented in this example. In Rhea, for the moment, we represent just the two side of the of the compound that is transported. So this reaction we use to token out and in, but in fact, we could use a side one side side two. And in the future, we will try to to replace this very rudimentary token by something a little bit more explicit, but it just say that a molecule is transported from one side to another to another side. And transport reaction are the only case where the same compounds is present in the two parts of the reaction. So Rhea is available in a website, which is with a URL provided here. So Rhea-db.org. So all the data in Rhea provided in Rhea are freely available. So we use the license CC by 4. Recently, we changed, we updated the website. We have now a new and very nice, nice website. That is designed by Paris Bonsalle present in this in this training. And this has been described in a recent publication in the last now database issue. And just to show you, I won't go in the in the detail, but behind this website, we have a very complex architecture to maintain the to maintain the data. And a lot of people are behind all these nice, nice tools to help us be a curator to manage the data. So here we are in the home page of the Rhea website. So what you can see in this homepage at the top is the information on the current release because Rhea is published in synchronization with Uniprot release. So we have four to six release per year. And you have some basic statistics. So we can see that in this current release, we have about 13, 900 reactions that involve 12,000 compounds. And this reaction have been evidenced by 15, more than 15 unique, unique publications. If you want more detail on that on statistics and on the news, you can click to this link. And if you click on the icon and the real, the real logo, then you will go back to this, you will go back to this homepage. So let's start to see what is the content of Rhea. So you just click on this browser button. And you will get a table with all the reactions that are available in Rhea. So this is the table to help you to browse and navigate among the data. So we provide filtering results. So you can filter by the kind of reaction, for instance, a reaction involving proteins or nucleic acid access directly the transport reaction. And we also provide filtering for enzyme classification. This is what I show you before, where you can access for each class of the enzyme classification instrument place the corresponding reaction. And you can see that 38% of the reaction are not yet classified. Here, it's an example of a reaction page. So you are here, a reaction page to help you to navigate. We provide a navigation menu on the left that help you to go directly to one section. So all the section present in the reaction page are the reaction information, the reaction participant, the cross-reference, the related reaction, the publication, and the comment. I have a little problem. Can you hear me? Yes. I am confused, but my, okay. My computer was freeze. So here we go in the reaction information section where you have the structural representation of the reaction. At the top of this page, we have a small icon that allow you to copy the text of the chemical reaction in the clipboard. So we do not present this text, but if you need it to copy past in a document, you click on this document, it's on the buffer and you can use it in your documents. There is another icon that allow you to download this reaction in RXNRD format, which are chemoinformatic format, and I will come back later to this format. But just to show you that for each reaction, you can download the data in different format. If you mouse over the participant of the reaction here, the Simonium group, so you will have a tooltip, this black tooltip that allow you to navigate in the different, in different resources. So you can search chemical reaction in REA for this specific molecule, the selected molecule. You can search REA for the molecule that contents are resemble to this structure. So you have a link to the structural search, I will show you in detail later. So we provide some links to the KB website or you can retrieve all the protein in uniprot KB that are using this, that are annotated with this molecule. In the reaction section, we also provide a summary of the enzyme annotation. So this work is not done by REA biocurator, it's done by over curators, but we provide this summary. So you can have information in uniprot KB in the enzyme classification or in the GO molecular function, and we provide links to the different resource as for the chemicals. The second section is the reaction participant section, so you can show or hide these sections. And once again, if you mouse over the link, you can have some tooltips in order to help you to navigate. And we also provide for each compound some tutorial information that we will see later how we can use it. We provide cross-reference to over resource, so enzyme resource, but also some over metabolic resource like keg, metasite, kecosite, creatome and MCSE. So here in this table, you have four columns. It's because REA is organized as a quartet of direction, let's say. So when you search in the, when you perform some search, you will also return the identifier of the unspecified direction. So for this reaction, that means we ensure that the two reaction parts are equilibrium, but there is no information, the direction of the reaction is not specified. Whereas in this ID, with this ID, it identifies that the reaction go from the left to the right. We have also an ID that allows to describe the reaction that goes to the right to the left. So here we have a semantic associated to the reaction side. So here this right reaction side is correspond to the substrate and this left reaction side correspond to the product. And we have also bidirectional reaction when the reaction goes in the different, goes in a different direction. Why we need that? It was to be able to link to all the resource and to have a precise link between the resource. So in the undefined, with the undefined direction, you can find links to Uniprot KB, EC number, because by definition EC numbers do not give, do not give any information about the direction of the reaction. So all these numbers are always linked to the undefined direction. It's the same for gene ontology molecular function. The keg reaction are always bidirectional reaction, but in metasyche and ecocyc, you can find the four possible directions. So they provide some unspecified direction left to right, right to left, or bidirectional. Reactome is mainly left to right or right to left. And maybe in the future, we will provide links to reactome on the undefined direction just for the transport reaction. But this is something that is not not sure. So for the moment, consider that reactome has only two direction and MCSA, which is a resource developed at EBI and describe the mechanism of the on-dematic reaction. We provide links to left to right or right to left. The cross-reference to other metabolic resource have several origins. So some of them are created by real bioculators, so KB, keg, and metasyche. For reactome, the links to reactome are computed best on KB. So reactome is using KB, like CREA. Sometimes it's not exactly the same KB we will see later. But based on this KB, our developer set a procedure that allows to retrieve the corresponding reaction, and then we can link CREA to reactome. And as I told you before, uniprot enzyme sequence and genontology are performed by the curators of this resource. Maybe just a word on genontology and reactome, because you will see that uniprot, a three-spot group is a member of the uniprot enzyme. So we are uniprot curators are working very close together. But we are trying with other colleagues of our groups to unify the function descriptors. And now genontology and reactome are also using RIA as a control vocabulary to describe their annotation on an enzyme. So the RIA to go mapping is performed by the Go consortium. They started less than two years ago, so it's ongoing, but we are closely working together, and it allows really to have a unified way to represent enzyme function. So now we go back to our reaction page, and we see the next session, which is related reaction. So you remember our example, so you have the reaction. So in this section, it presents more general form of this reaction, which is represented here. So we have a reaction that involves a monocarboxylic acid and a monocarboxylate. And if you go in this section, you will have a new section that is rated reaction and that present the specific form of this reaction. So that helps you to navigate between the different reactions. So if you have a reaction, you can go, there is a parent and what are the other form of this reaction, it will allow you to find similar reaction. In the publication section, so we provide the title, the authors and the link to Pumen and Euro PMC. And if the publication is cited with over is used by over reaction, we provide a link with this tool, we can see that this publication is cited by two other entries. So you have links that allow you to perform the query and retrieve the corresponding entries. So I talk a little bit about reaccuration. Let me present the team. So we are a group of six bio curator to do this job, knowing that some of them are also involved in curation in over resource. And is also involved in the curation of enzyme nomenclature and also uniprot curators and Lucila is working on the Swiss lipid resource. So it's, she's our lipid specialist. She's also working on uniprot annotation and Elizabeth and Neville are working on bio curator of the uniprot resource. And here are both Alan Bridge. So really is built on on KB what is the job of a bio curator. So it's to read paper extract the reaction from the paper and for each reaction participant in on the five compounds in the KB resource. So KB is developed at TBI it's the chemical entity of biological interest. It's an ontology. So it contains about 60,000 fully annotated chemical entries and also more than 100 chemical entries but are not fully reviewed by the by the KB curators. So it's a very strong collaboration with KB bio curators and rare curators are submitting a lot of compounds to KB. So if a compound we need for rare is missing in KB we can submit this this compound to KB and sometimes also update some KB entries in order to add missing information. So real bio curators have submitted about or updated 6,000 KB entry, which is quite a big number and correspond to half of the compounds we were using in real. I'll show you the different kind of reaction participants so all our participants small molecule macro molecule or polymers are linked to KB to a KB ID. But in the case of macro molecule and polymers, we will have an additional identifier to represent this. So here is an example of L lysine protein. So this protein we just use the lysine residue to represent the molecule. Same thing for the veterinary and you can see that here we have this star that represent demi atom that are the attachment point to the larger macro macro molecule. And for the polymer, we also have an additional layer if we need to have a different polymerization index so polymer is a molecule with a constant part very reduced in this case, and a repeated part. So for this repeated unit, you have a polymerization index that is n in KB but sometimes in polymerization index we need we need to have n plus one and plus two etc. So we have an additional identifier for this molecules. So let's see a little bit more in detail what we have in in KB and what we are using. So this is a typical entry page for KB KB compound. So what we use is the KB ID identifier but we use also the structure. So here you have a 2D the display of the 2D structure. This come from the mall mall file. So a mall file is a chemo informatic format is using chemo informatics. So in this format you have several sections. So you have a section that describes the atom block. So you have the coordinates that correspond to this display. We are in 2D so it's x, y, z, but z is 0 because we represent only 2D structure and this last column represents the atom. You have a section that describes the block and you can have several sections but in this case for instance describe the property block to describe the global charge of this of this molecule. We also use information from KB for the formula and the net charge of the molecule as well as well as other way to encode the structure of the molecule. So the international, we can use SMILES and INCHI. So here it's an example with another molecule. It's Haldéidot-Diglicos 6-Pasfap. You have the 2D structure of this molecule and from this structure we can compute canonical SMILES. So you see in green the same color here in this string. There are also isomeric SMILES if you want to take into account the stereochemistry of this atom. Because the SMILES are not unique, so this string it depends on the way you go through the molecule. So you can have a lot of valid SMILES but they are not unique. You can't use them as an identifier. We also describe UPAC, an international nomenclature committee for chemistry. So they design INCHI, what we call INCHI, International Chemical Identifier. So it's another way to present the structure of information encoded in a string, but in this case it's unique. From KB, we also use the names and the synonyms. Inside KB, you can see once this synonym is of specific interest for us. So you can see there's a synonym in KB with the source uniprot. So there's synonyms correspond to what we call uniprot names. Very often we can talk about uniprot names. So it's the KB synonym that is used by RIA or uniprot to label reaction participants. But you will see with my code it also to label cofactors or modified residues and ligands. So we have this specific synonym because in this way KB can change their KB common name if they want and there is no impact, no consequences to the chemical reaction. And to the use of the cofactor in uniprot for instance. So we can manage the name. KB is an ontology, so that means it represents concept but also relation between these concepts. So here you have an example of the relationship that you can find in KB. So they have some typical relationship that we can find in all the resources like either or as part of the relationship. Here with the pre-angle it's either relationship. So that means the D-alanine, either D-alpha amino acid, either non-proteurogenic alpha amino acid, either alpha amino acid, etc. But the particularity of KB is also that they have some dedicated relationship for chemistry. And another particularity of KB is that they identify the different form of the same molecule according to the pH. The same molecule according to the pH can exist in different form and KB identifies all these form and provide some relationship between the compounds. So for example D-alanine you can have some relationship to D-alanine or D-alanium. So they represent the same molecule D-alanine but in different form. In Rear and MetaPsych we will use a specific form of this molecule that correspond to the major macrospecies at pH 7.3 whereas for instance keg is using a fully hydrogenated molecule so not charged and charged. And we prefer to use charged molecule because it helps us to balance the reaction from mass and charge. So most molecules contain some specific functional groups likely to lose organ proton under specific circumstances like pH or temperature. So in our D-alanine example there are two such groups. You have an amino group here and you have a carboxylic acid group here. So each equilibrium between protonated and deprotonated forms of the molecule can be described with a constant value that is called a pKa. And based on this pKa we can compute the major macrospecies at a given pH. So we use the k-maxon software to perform this calculation and if you use one all the kB structure that identify D-alanine and that you do this computation major macrospecies at pH 7.3 you will always retrieve this kB ID that is the D-alanine sweet cereal. So this is very useful to be sure that Rea is non-redundant because if one Rea accumulator choose one form and the other another form we could have the same reaction but for different pH so all our reactions are balanced for pH 7.3. So we do, we set up or developers set up a procedure to do this calculation for the whole kB data set. This is very useful for Rea and Uniprod but also for genontology and reactome and we provide this mapping in different format. What is important to remember is that you don't have to manage this major macrospecies at pH 7.3 things. We do that for you. So if you search Rea or Uniprod using one of this kB automatically the application will return the kB ID that we use in Rea and Uniprod. So we do the job for you and this using this molecule at pH 7.3 as I told you allow you to be sure that our reaction are non-redundant and chemically balanced for mass and charge. Chemically balanced for mass and charge means that the sum of the formula of the participants on each side are identical and the same thing for the charge. Okay now let's go to the next point that is okay we have seen the content of Rea, how we can search Rea. So if you are new by, if you don't know Rea, you can click on the example provided on the website and it will give you an idea of the different way to query the resource. So you can search by name, you can search by identifier, Rea kB is the number. We will see that we can search by NGT by structure and you can search also on the cross-reference that are provided in the resource. We have developed some tutorial, we have made some tutorials. So I will show you one of this tutorial, so how to search for reaction involving specific compounds. In this tutorial, we will learn how to search for Rea reactions involving specific chemical compounds. Rea uses the kB dictionary of small molecules to describe reaction participants. This allows to search Rea taking into account the synonyms, the charge and hierarchical classification of chemical compounds. For a start, let's do a simple text search for palmitate. Here are the reactions we obtain. However, as we can see, many of these reactions do not seem to contain the actual word palmitate. We will start to explore our results and understand the relationships of these reactions with palmitate by clicking on choose molecule for palmitate. We see that palmitate is found in the name or synonyms of several molecules or classes of molecules. For example, the first two results, hexadecanoate and hexadecanoic acid, have palmitate as one of their synonyms and participate in the same number of reactions. Let's see the reactions for hexadecanoate and we clearly see that our molecule participates in these reactions. If we do the same for the molecule hexadecanoic acid, we see that it does not directly appear as a reaction participant in Rea. To understand this, we need to explain a convention used in Rea. KB handles all forms of a given chemical compound, for example neutral versus charged, fully versus partially protonated. Each form has its own unique KB identifier and related forms are linked together. Rea, in order to provide non-redundant and chemically balanced reactions, uses those KB entities that describe the form that is the major micro species at pH 7.3. hexadecanoate, but not hexadecanoate acid, is used as a reaction participant at pH 7.3. You can find out more about these concepts in our help pages. If we want to exclude cases like the one just explained, we can click this checkbox. Only a subset of the previously identified KBs are reaction participants. Note that hexadecanoate acid has now disappeared from this list. We go back to browsing the Rea reactions that have hexadecanoate as a reaction participant. We can click on this link here. We see that the query box is now filled with a query that includes the KB identifier of hexadecanoate. We could have done the same query by using the advanced search. To do so, we click on advanced search, then on all, we select reaction participants and KB small molecule and start writing palmitate. Once again, don't be surprised if many different names are proposed that at a first glance do not contain the word palmitate. It's just that the query is performed among all names, including synonyms and differently charged versions reported in KB. By selecting hexadecanoate and clicking on search, we retrieve the same reactions as before. In the reaction table, if we want additional information on our molecule, we can click on hexadecanoate. A black tooltip appears. We can select to see the description of this molecule in KB. A lot of information is available about our molecule on this KB page, including a list of hierarchical relationships with other molecules. For example, we can see that hexadecanoate is a long chain fatty acid anion. We can now search for reactions with this KB identifier and to do this, we go back to the rear home page, paste this KB ID into the simple query box. We get a significant number of rear reactions in our result page. We can see that some of the results still display reactions involving hexadecanoate, which is expected because it is a child of long chain fatty acid anion. This result nicely illustrates the fact that the default search in rear is a hierarchical search. If we search by reaction participant with a KB ID, we will retrieve all the rear reactions that involve this KB molecule or one of its children. It is of course possible to retrieve the rear reaction in which long chain fatty acid anion is mentioned as a reaction participant. To do this, let's use the advanced search. Enter the KB ID we used previously and click on Exact KB Search. By selecting this option, the list of matching reaction becomes a lot shorter and we can see that the query form was changed and it now shows the prefix KB Exact. These reactions involve as a reaction participant the class of compounds which is labeled in KB as long chain fatty acid anion and in rear with the synonymous term a long chain fatty acid. All the numbers shown in this tutorial will of course change regularly depending on the. Okay, so thank you Elisabeth and my crew that did this video I break it because it's just some warning about the content I will come back later. So we have seen in this video the rear advanced search, how to use the rear advanced search. So just to show you some additional information, so here you have a complete list of the field that are available for query. And also remember that you can have some Boolean operator in order to perform some complex query. So for example, let's say that I would like to retrieve all reaction annotated in Unipot KB that involve lipids. So a children of KB 18059, which is the KB for the class of lipids. And I am interested only by oxydoridic cases on time, and I would like to have a reaction that mapping to the go and and react home. So here it's how you set up this query in the advanced search. And if you are, if it's okay for you and you are aware how to search, you can also use the simple search and to perform the query. The two queries are absolutely identical. And as it was mentioned in the video, we have performed hierarchical search by default, but the search can be limited using exact search. So we see the example with KB, you have exactly the same behavior with the gene ontology. So in gene ontology molecular function you have a hierarchy. And if you search by parent terms, you can retrieve all the different terms that are mapped to to reaction. But if you want to search for an exact term, you can filter this exact search box. Okay, so let's say that we have a bunch of molecules from one experiment. How do I find reaction corresponding to this molecule? So you just have to enter your list of KB with the or Boolean and click on the search button. And you will get the result list. So the list of reaction that involve one of these compounds. If you click on this customizer icon, you can change the display of the table. So for example, if you're more not interested by enzyme data, but you would like to see the KB name and the KB identifier, you can change customize the display of your table. You select your fields of interest, click on save, and you get this new display where we have two new columns. So we have removed two columns and added new ones where you can have the name of chemicals and its KB identifier. At the top of the table you have other icons, possibility. So if you click on the find the enzyme button, you can retrieve in uniprot the set of enzyme that are catalyzing the selection of reaction. You can download this table result in a table and you can also bookmark the URL if you want to share it with to use it later or share it with with colleagues. It's interesting to do this simple search, but it doesn't scale very well as the number of KB increase. So in that case, it's better to use the retrieve ID mapping functionality. So you can retrieve reaction with different identifier type so you can have ID from reaction so you have a set of ReID or you have a metasite keg rectum reaction IDs. Or you can search Rea by KB, KB exact and in Chiqui, we will see in Chiqui later, or you can search Rea with a set of EC number or genontology go identifiers. So you just have to fill the textbook with your list of with IDs or you can upload a file with the list of your identifiers and you click on the submit button. So in this case, you will have a table but compared to the other search, you will have with ID mapping an additional column that correspond in this case it was an ID mapping with KB ID. So you have your query and in parenthesis you have we provide the KB ID that is exactly used in the reaction. So it's particularly important in case of either mapping or in case of major microspaces, protonation state mapping between your query and the result. There's a search are available by a programmatic access so we have for the bioinformatician, we have a REST API that allows to programmatically access the data. Here it's an example of query, the results are provided as in tab separated format but you can you can change the format. It's an example in Python, but you have also over language available like JavaScript or Java, and you can also specify the columns you would like to see in the result table. So in case you have some questions, do not hesitate to contact Parrot who developed this programmatic access functionalities. Now let's say that we have a bunch of chemical structure but no KB identifier. So how do I find this rare reaction so in that case, you if you provide your chemical structure as a set of inch keys, you will be able to retrieve the corresponding rare reaction. So now let me show you the second video but Marie Claude and Elizabeth as prepared to show you how you can retrieve reaction using chemical structure encoded in inch key. Welcome to this tutorial where we will learn how to search for rare reactions using the chemical structure information encoded by an inch key. An inch key is a computer readable representation of the structure of a chemical compound. It is derived from the international chemical identifier or inchy standard, a textual representation that describes a molecule in terms of different layers of information formula, atom connection, hydrogen position and charge. The condensed inch key is a hashed version of the full inchy, which is designed for easy web and database searches of chemical compounds. An inch key is composed of 27 characters organized in three blocks. The first block encodes information about the molecular skeleton and atom connectivity. The second information about the stereo chemistry and the third block information about the charge of the molecule. This representation of the chemical structure of molecules is mainly used in programmatic access. However, the rear website supports searches for rear reaction participants by complete or partial inchy key using this simple and advanced search and the retrieve ID mapping service interactively as well as programmatically. Let's look for rear reactions involving Aldehi.d glucose six phosphate, which is represented by this inchy key. We copy paste this inchy key in the simple search box using the prefix inchy colon. It returns three rear reactions. Note that the URL of the result page contains the inchy key and can be bookmarked. We could have done the same type of query using the advanced search. To do this, we click on advanced search, click on all, select inchy key and copy paste our inchy key. We can also query rear with a partial inchy key omitting the second or third block. Information on the charge of our molecule is encoded in the third block of the inchy key. If we query with the third part omitted, we get the same three reactions as before. The reason is that the research engine bypasses the charge constraints and always returns the rear reaction with the heavy compounds that correspond to the major micro species at pH 7.3. This concept is explained in more detail in the guided tour to rear's key features and web interface, which is linked below. Queries with or without the third part of an inchy key will therefore always return the same results. If the stereochemistry of our molecule is not relevant for our search, we can search rear using only the first part of the inchy key. Now, we get six rear reactions. We can see that these rear reactions share a reaction participant, which has a common molecular skeleton. If we want to have additional information on the inchy key and the corresponding molecules, we can click on a rear reaction. The information is found in the reaction participant section. When working with inchy keys, it is more convenient to use the retrieve ID mapping service, which has additional useful functionalities for this use case. We click on the retrieve ID mapping link, select the identifier type inchy key, copy paste our list into the box and click on submit. The result table contains a column with the queried inchy keys and what is interesting, the corresponding keby identifiers. The search engine performs an exact search. It does not take into account that is a relationships found in keby. We can observe that our two inchy keys correspond to two small molecules, which participate in two rear reactions. If the stereochemistry of our molecule is not important, we could have used a partial inchy key. All the results should. Okay, so as a summary here where you have the four type of reaction reaction participants so such by inchy key will be only available for fully defined a smaller small molecule. All reaction involving the best molecule for this molecule. It's not, it's not possible. I think you understand that. Let's quickly go to the substructure similarity search time is is going quick. So if you click on this, on this link, you can access to the structure similarity search of free arm. So it's borrowed by IDSM. It's a check, check group. It's part of the elixir. Check, check node and they have developed a cart branch in order to be able to perform substructure and similarity search on the small molecule. So I just give you some pointer if you are interested to the detail but we have no time to do it this this afternoon. So it's based on the sparkle technology that we will see at the end of the afternoon. So if you click on this structure search, you have a new kind of function as shown in this. You can enter a structure either by a smile, you can draw the molecule with this picture, or you can upload some some molecule using some more file using this functionality. So let's say that we want to perform a substructure search. So we select the search type, the search structure and the result will be a list of KB that are used in used in India. So if you click on this link, you are provided to the search result corresponding to to the KB selected. So we can access to this search structural search page from the real page when I show you at the beginning. So if you click on find molecule that contents are resemble this structure, this form is automatically filled with the data you you selected. And for instance, if we want to perform a similarity search, we select this this option. And once again, you will get several KB that, depending on your query, of course, you will get the resulting KB that fit your query. So play with the different parameters in order to refine your your search. So quickly, the real export in terms of availability. So this export are available on the download page where we have three section reaction reaction participant and cross references. So the data concerning the reaction we have several formats semantic web format. I will come back at the end of the afternoon. We have also some chemo informatic format or tab separated format. So, as I told you, it will be later. And for the Ericsson Ericsson and the format so Ericsson format is I show you the mole file Ericsson format is just for a reaction. It's the concatenation of participants involved in the molecule. So it allows to it can be used by chemo informatic software and Ericsson format is the same as Ericsson but with additional additional data. So from a reaction page, if you select for instance download in Ericsson format, you get this information in a new tab that you can save. And you can save the file or you can copy past in in browser and you can display the content, the real reaction in some of our chemo informatic tools. So it's an example of a tab separated format for the direction for the relationship between the real reaction and also important for reabsolute ID because like I show you for on time classification we also have some reaction that we consider obsolete in the time. So there are no more publicly available but the list of ID is important. For the reaction participant we have chemo informatics with the mole and SDF we have tabs separated. So here it's the KB PH7.3 mapping.tsv file that is important if you want to map any KB to the normalized form used in RIA. In Semantic we have an export of KB that is called KB.all and just to warn you that this KB.all can be different from the KB.all provided on the KB website, because it correspond to the set of data that are synchronized with Uniprot and RIA. And it contains additional information that are used by Uniprot and RIA like a dedicated predicate for the major microspaces at PH7.3 and also the name, the synonym with source Uniprot that is very useful for in our case. We also provide some tab-delimited files for the cross-reference. So here you have two examples, one of the links between RIA and EC numbers or the links between RIA and Uniprot. For the KG user, so they are not advertised on the download page, but they are available on the FTP site. We provide some distribution and export of our data in the KG format. So that means that if you have some software to process KG data, you can upload RIA data in this format. So to finish, in terms of comparison of the different metabolic resources, so you can say that there are three main resource, main reaction resource, metasyche, RIA and KG. So we have nearly the same number of reactions. So in terms of compounds, so 12,000 is the number of compounds that are effectively used as RIA reaction participants. But as I show you, you can screen a lot of the full KB database. So you can search for any KB ID. You can have answer or no answer, but you can search for the full data set. For the enzyme, so it will be the next presentation from Mike Claude that will explain Uniprot and KG has no release whereas metasyche has four and for us it's between four and six. And compared to the other, we are freely available. You don't need any subscription to access the data. So please don't hesitate to contact us to help improving our resource for your community. So please click on this feedback form and if something you don't understand or something is missing or don't hesitate to contact us. And I show you some video during this presentation and they are all available on the homepage. You have some link to access this YouTube video. Okay, so good afternoon everyone. I'm very happy to be with you for the second part of this CIP training course. So this second part will focus on the enzyme sequences and more particularly on the all the information which can be found in the database Uniprot KB. So as for the first part, please do not hesitate to ask your question in the Google Doc. So we will see that the tight relationships between Rea KB and Uniprot allow to have a lot of information on the enzyme and this is what I will try to share with you just now. So in this part, I will first leave a short introduction and overview, and then I will focus on Uniprot. As you may know, Uniprot KB is composed of two sections, one section SwissProt, which is manually annotated and reviewed and another section which is called Tremel, which is automatically annotated and not reviewed. And I will go in detail for both sections to explain where the protein sequences come from and where the functional annotation come from. And at the end, I will try to give some examples of query you can do using the Rea and KB information in the Uniprot website. So first brief introduction. So for the biochemical reactions in Uniprot KB were annotated using the IUBMB enzyme classification. So according to the EC number. And now we use Rea and the Rea and KB identifier and the EC number to annotate the same reaction. So it means that instead of having a chemical name you have now the chemical identifier. The Rea identifier and the EC number to describe a biochemical reaction in Uniprot. And according to the information which is stored in KB, you can have access to the 2D structure and the classical representation of an enzyme reaction. So in detail, this was the first way of annotating an enzyme reaction in Uniprot in the first version of this entry here. And now if you look at the latest version you see that the reaction is annotated according to the KB identifier, the Rea identifier and the EC number as well. You also have information on the source of the affirmation so a link to the POMED ID and also a link to the information if you have an information on the direction of the reaction. If you want to have additional information on this integration process, do not hesitate to go back to the headline here published in 2018 or to our paper which explains this process. So as an overview, it's just to try to sum up what Anne said in the first part. So all small molecules which are participating in an enzyme reaction are now named according to the vocabulary and ontology found in KB. And as Anne mentioned, we use in Rea and Uniprot only the major microspaces at pH 7.3. So the Rea reaction, the biochemical reaction and the transport reaction are now integrated into Uniprot together with the enzyme classification, the EC number. And thanks to this integration, now you have access to the biochemical reaction, you have access to the protein sequence, you have access to other information such as the succeeded allocation, genetic disease, taxonomy, 3D structure related to the enzyme. So the close links between Rea, KB and Uniprot allow to access to a wealth of information on each enzyme. And this type of integration allow to ask questions such as how many human enzymes are associated with a genetic disease, or more precisely, how many human enzymes involving for example dopamine are associated with a genetic disease. So I can see this type of query at the end of this, of my, of my talk. So this is the Uniprot website homepage. Uniprot is maintained by the Uniprot procession, which is composed of the EMBL, EBI in the UK, the protein information resource in the US, and the Sieve Institute of Bioinformatics in Switzerland. So together they are about 110 collaborators working on this resource. And I'm quite sorry about that, but there will be a new website on Uniprot by the end of the year, but the website is not yet enough finalized. So we decided to keep the previous version of the website for this course and also for the training, which will take place next week. But, but, but we need you help. So please do not hesitate to have a look at the new beta website to send us your feedback, because we need user to participate in new functionality testing and so on. So do not hesitate to send feedback. It's very helpful for us. And especially when there will be a new website. So Uniprot, as you may already know, provides several resources. Third is the Uniprot UNIPAR, which is a database of all the protein sequence that have existed on one hustles. It contains only sequences and taxonomy and accession number that you have no annotation in the databases. I will not go further in detail if you have any question, do you not hesitate to ask. provide a uniref, which is the database of sequence clusters of 190 or 50% identical sequences. And something maybe more important biologically here is the proteome. So the protein set, which are expressed by a given organism, for which we have a complete genome sequence available. I will give you more info later on this protein, a proteome database. But now I will focus on the unicorn's knowledge base and on these two sections, so Swiss plots and tremble. First of all, a few number and statistic on the number of records. So in Swiss plots, there are half a million of records, which have been carefully selected. Most of the time, we annotate the model organism. Or we annotate a protein according to user requests. And these half a million of entries are very important because they are used to create automated annotation rules, which will be used for the annotation of the tremble section of unicorns. So you see in tremble, we have 225 millions of records. But again, beware the statistics on the number of records because they are not really easily comparable because the redundancy is not at the same level. In Swiss plots, you have one record for one gene for a given station. In tremble, you have one record for one protein for one station. So if you look at the number of protein sequences, you have more protein sequences in Swiss plots compared to the number of records. But you have the same number of protein sequences compared to the number of records in tremble. So just for information, in tremble, the redundancy is low because 100% identical protein sequence are matched together. Then the source of annotation. In Swiss plots, it's mainly publication. And in tremble, it's automated annotation rules, either rules which are manually generated or rules which have been automatically generated according to the content of Swiss plot entries. And now if we focus on enzyme records, about half of the Swiss plot records concern on enzymes. And about 16% of the tremble records are linked to an enzyme and a biochemical reaction. So today, just to explain the difference between Swiss plot and tremble, we're going to focus on a given gene from a human and on this specific reaction. So this reaction is involved in the swungolipin metabolism. So you have a description here of all the molecule involved in this reaction. And the question is, what type of additional information is available in uniprot? And what will be the difference between the Swiss plot and the tremble records? So first of all, if you do a query with the gene name and organism, a human, you will see that you get six different records which have the same gene name and which are all coming from human. In fact, this is a redundancy which is always present between Swiss plot and tremble. Because even the manual creation effort and the increasing amount of protein sequence data coming from high throughput genomic sequencing project, it's not possible to merge and to ensure non-reundancy at this level. That's why you can find 100% identical sequences for the same species in uniprot Swiss plot. And in uniprot Swiss plot, yes, and in uniprot tremble. So what I want to show you now is just the difference between the content of the entry which has been manually annotated by the expert bio-curator. So those are Swiss plot 1 and the tremble 1, which has been annotated using automated annotation rules. So first, I will focus on Swiss plots and have a look at these entries here. So you know that the entry has been manually reviewed by expert bio-curator because there is a statute reviewed here. And this really means that the records have been annotated thanks to the information which has been extracted from literature or also from some computational analyses which have been evaluated by the bio-curators. And if you look at the publication which has been used to annotate this entry and you click here, you see that there are quite a lot of publications which have been used by the bio-curator to annotate, for example, the function of the protein, the subcellular location, or disease, or post-processional modification. And in fact, the reading publication, but the literature triage is even a more big effort which is done by the bio-curators. So we have to read publication, but first we have to select the good publication. And we evaluate that curators are reading, sorry, that curator evaluates around the 50,000 to 7,000 papers per year just during the course of the curation work for Uniquad switchboards. If you want to have additional information on this work of literature triage or the bio-curation process, you can look this video here or you can go to the publication which is linked here. So now we're going to go through the different section of the uniprot entry. If you have a look at the taxonomy, name and taxonomy, for example. So this entry, this gene is encoding a protein which has three enzyme reactions, which catalyze three enzyme reactions. You see here three EC number. And you see that the name of the protein can be very different. There are really quite different protein names. So that's why there is no real process with some protein name. When you look for a given protein, please query the protein with the gene name. There are nomenclature committees for the gene name. So it's much more easy to query protein according to their gene name. And also you have information on the taxonomy. So you have the tax ID, for example. And you have information on the protein. We talk a little bit more on this later. So first the protein sequence. Where do the protein sequence come from in uniprot? As I said previously, so in this plot you have one record, one gene, which could have several proteins, which could give rise to several protein sequences. And you have one species. So for these specific genes here, we see that there are five different protein sequence which are produced by these given genes. And if you look carefully, you see that these protein sequences have different lengths and may have different function because the length is not the same. And this is due to the fact that there are some alternative splicing events or alternative initiation event. And that's why from this given gene you can have five different protein sequences. This is an important information. All the positional information in this country we refer to the canonical sequence. So we choose a canonical sequence, a consensus sequence which is going to be carefully annotated. For example, the position of the post-processional modification, the position of the variance will all be referring to this canonical sequence. And if you download one of these record, you have the choice to download only the canonical or the canonical and the isoform sequence. And beware, not all data sets include the isoform sequences. For example, if you do a blast against this product and the NCBI, they do not include the isoform sequences. So if we go back to the query we do first, you can see that so the Swiss product actually provides five different splicing sequence with different lengths. And in fact, in the tremble, the different record in tremble correspond to the different isoform which are already annotated in Swiss products. But as I said before, it's not possible to merge all these data. That's why you may have a tremble which contain exactly the same protein sequence as the Swiss product entry. But what you have to be careful about is if you align all these protein sequence, you see that they are quite different. And for example, this protein is a lysosomal enzyme. And protein are target to lysosome thanks to a senile peptide, which is localized at the beginning of the protein. So you see on the five protein sequence, only two have these senile sequence. It means that only two of them will be localized in the lysosome. So it means that the information about the protein sequence is very, very important. And when next time you write a publication, please cite the accession number and specify if you can which protein sequence you're working on to be sure that we can then integrate the information in uniflots. There are already some people who do a great job. And when they talk about enzyme, they just specify the accession number of the corresponding accession number in uniflots. So this is really nice and takes a lot. But the next step will be to give information on the sequence and not only on the gene name and the canonical sequence. So for example, we have added some information on the related to protein sequence. So we have 18,000 functional annotation which have been linked to isoforms or chins in Swiss Pratt. But this is really the beginning of a big job which is ongoing and for that we really need your help. So where does these protein sequences come from? 99% of the protein sequences which are in uniflots come from the translation of the nucleotide sequence which are found in the public nucleotide sequence databases. So the international nucleotide sequence databases collaboration which include EMBL, GeneBank and DDT. So the protein sequence come from the translation of a marinade or from genomic data. And the job in the UDPRAT is to try to put together all these sequence, choose the best one and correct the sequence because we want that the protein sequences match the genome sequence. So sometimes you can see that there are a list of conflict which is because the protein sequence which were the translation of one of these entries were not mapping the genome sequence. Sometimes there are even worse problems such as bad gene prediction or things like that. So if we look at the statistic about 15% of the Swiss Pratt and three required curation effort to correct the protein sequences. And if we focus on Newman, about 89% of the protein sequence required curation effort to correct or confirm the sequence of the protein. And when we know the importance of the quality of the sequence for the approach which use now machine learning or artificial intelligence to learn from the language of protein, you can guess that the correction of protein sequence is very important. And just to underline, DeepMind who has done the prediction of 3D structure for all proteins are based, use the unit protein sequences to predict the 3D structure of all the proteins. So now this is about the source of annotation. So as I already mentioned, we go from publication, we could say unstructured information to structured data. And this expert curation process is really essential because it's allowed to add value to research data and to make the data usable for machine learning and artificial intelligence. But what you have to remember is that most aspect of the bio curation is quite complex and cannot be replaced by machine learning. So here, if you look at the function of the gene GBA, you see that you have some free text, human readable section here. You have the catalytic section, which is structure, computer and human readable and use control vocabulary thanks to Rea and KB. And you have also then the gene ontology part, which is also a structure and the computer and human readable using control vocabulary section. Just this morning, we received a nice email. I want to share that with you. We just said that you somehow succeed in writing shop with complete summaries of the most important things we currently know about one protein. It's quite nice to see that some of the user appreciates the content of the bio curation process. So where does it come from, dysfunctional annotation? In SISPRAT, the major information come from publication. The information come from by similarity with another species, another about a lot. Or it may come from prediction. It may be imported from another databases or it may come from sequence analysis. So for example, and sorry, and each time, we mentioned the source using the evidence and conclusion ontology called echo. So for each of the source, so for example, for BOMED, we have an echo with this number. For Uniprot, we have an echo with this number and so on. And if we look at the source coming from sequence analysis, such as FUBUSE, which predicts trans-mobrain, FUBUSE is one of the program which is used by the bio curators to analyze and predict, for example, sub-cellular allocation, trans-mobrain domain, protein topology, post-processional modification, and also domain identification and protein family classification. So all these programs have been validated by the expert to be used for the biocuration of protein sequences in Uniprot. So FUBUSE, for example, which is used to predict trans-mobrain and senile peptide is used. And the echo will be FUBUSE. You can see in the link, it will be written FUBUSE. Now, if we go to the publication and look, for example, at this publication here, so you see that this publication here is used for the annotation of the function as a free text, but also for the annotation of the catalytic activity done by a biocurator. And this is this example here. And if we read the text, we just see, OK, this is an enzyme with DCC number. These are the substrate, which is used here. And you see that we have, as Anne mentioned before, we have to map this name here with the KB to be sure that we will use the correct control vocabulary. So by using KB, we map this name here to this name here, which is one of the synonyms used in KB. And we can have a KB identifier for each reaction participant. And this is a tight collaboration between Uniprop and Rehabiocurators. And as Anne mentioned before, in Uniprop, we use one given name for a biochemical entity, but there are other names which can be used by KB. If you want to have more information on the chemical compound here, you can go and open the tool tips here and go to the description, which is provided by KB. The same thing if you look at this reaction here, you can have additional information and click on Rehab to see all the biochemical reaction, which are involving this specific chemical compound. As mentioned previously, Rehab is supported. It means that if we have the information, we specify the reaction direction. So for example, thanks to this publication here, we know that the reaction proceed in the forward direction. So this is also an important information, which is related to a publication. And in the function section, you can find a lot of additional information related to enzymes, such as the activity regulation, the pH dependence, the temperature dependence in which pathway this reaction is part of. So here you see that the reaction is part of two different pathways, the cholesterol metabolism and the fungal metabolism. And you also have information on the active sites. And you can have a link when we see in a few minutes to the 3D structure of the protein. In the function section, we also have a list of gene ontology. So the Swiss blood bioperator work in tight collaboration with the co-concession. And when they read a paper, they annotate the corresponding go term. And as Anne mentioned before, the go molecular function, there are even more big work done on it, thanks to the collaboration between the gene ontology, the reactome, and the rare bioperator. You can also have information on the sub-cellular location of the protein. So in this case, I said before, the protein was in the lysosome. So this allowed to make quite complex queries, such as find all the human enzymes which are linked to any lipid unfound in the bulgy with experimental evidence. So this is the query you can do. For more info, I will give you more info later on. We also collect a lot of information about the post-processional modification and the processing of the protein. So these information are in the PTM processing section. And now Uniprot is using about 600 distinct modified residue. And all these information are in this section. One of the project now is also to map all the post-processional modification to KB. So as I said, these 600 distinct modified residue have to be mapped to the KB vocabulary. And this will help them to make the links between the PTM and which reaction is involved in this PTM. And then to find the corresponding enzyme in Uniprot, which is doing the reaction, adding, for example, a palm tree serine on the protein. So the next goal is to really to map all the small molecules and derivatives to only one chemical name space, namely KB. And we start now to do that for the post-processional modification. Another important information, which is starting in Swiss practice, the link of the gene with the genetic disease. So you see here that this gene is associated with three different genetic disease. You have the link to all the different variants which have been annotated here, and which lead to the corresponding genetic disease. And this is done in tight collaboration with ClinVar. So the bioperator involved in this annotation used the standard guideline for the interpretation of sequence variants. And they are tightly collaborating with ClinVar to improve the annotation of the heterogenicity of each variants. And of course, we also put the link to the underlying menelidine inheritance in mind, so the OMIM databases, if you want to have additional information on the disease, you can go from this cross link here. So as I say at the beginning, this type of information allowed to do the queries, such as how many human enzymes are involved in dopamine and are associated with genetic disease. And if you do the query, you see that there are three genes which are actually doing this type of, which are involving dopamine metabolism associated with genetic disease. Last but not least, we provide also the link to the 3D structure. So in the structural section, you have the link to the PDB entries. And what is done here by Swiss practice is to map the position of the canonical protein sequence to map the position of the protein which is available in 3D structure. And also at the end now, we have integrated all the prediction done by a deep, deep mind into the database AlphaFol, and you have access to the prediction. And you see that it covered the entire sequence, the entire protein sequence. So another view is available by using the feature viewer. So if you click here, you have access to all the annotation related to the protein sequence. And you have access to the 3D structure, which is appearing on the bottom. So you can click on, for example, the active site. You see the localization of the active site in the protein sequence. And you can see it on the 3D structure. So in summary, by linking a unique process with Rhea and KB, you have a wealth of information on enzymes such as the function, the catalytic activity, and the cofactor, which are directly linked to KB and Rhea. And then all information such as the bio, physical chemical property, the enzyme regulation, the pathway, the active site, the binding of the ligand. And this also, this topic will be mapped to KB very soon, as well as the post-processional modification, as I say, which will be linked to KB soon. And all this information are linked then to the 3D structure of the protein and the cellular location and genetic disease. This is the coverage of Unipot by Rhea. So you see here the number of Unipot, Swiss plot entry linked to Rhea, the number of unique Rhea reactions in Swiss plot and the number of unique KB in Swiss plot. And here you have taxonomy distribution of all the reactions which are available in Swiss plot. And you see that we cover all the tree section of life. Now I will just say a few words about tremble. So if I go back to the query I did at the beginning, so I will choose this entry, which has exactly the same number of amino acid as the Swiss plot entry. So we will focus on this tremble entry here. So when you click on the tremble entry, you see that you are in the tremble entry because the statute is unreviewed. It means that in this case, it has been only annotated by computer prediction and these records are waiting for full manual annotation. So the protein sequence in tremble, as I said the beginning in tremble, you have one record, one protein sequence, one species. So 100% identical sequence are merged together and the source of protein sequence is the same as for the Swiss plot entry, it's the public nucleotide sequence databases. But in this case, there is no sequence correction and no sequence validation. So we just take the sequence which is provided by the submitter and this sequence will be directly the sequence which is appearing in the tremble entry and the functional annotation. So this is exactly the same gene as the one we've seen previously in Swiss plot but in tremble. So you see you have the catalytic activity with the rare reactions here. You have also the raw molecular function and you see that all the source of information are also indicated. So in tremble, the functional information come from univural, which are the rules which are manually generated by bioperator. Or they are also annotated by automated generated rule. So the ARBA annotation is done by rules which are automatically done according to the content of the Swiss plot entry. So there is no review by the expert bioperator. Some of these, the information are imported from other databases or come from a prediction. For example, prediction of transmit range, signal peptide, transmit range, Coil-Coil or this other region. And if we look more carefully on the catalytic activity here, where does this information come from? So if we see the first one here, we see that the information come from univural. So we click here and have an information on this rule. So this real reaction has been associated with this protein sequence because the protein sequence contain several specific protein signature here from B-farm pointer. And also the taxonomy should be bacteria from B and Metazoa. And if the protein sequence match this condition here, meet this condition, then all these annotation apply. So including the protein name, the EC number, the catalytic activity, here it's the real reaction mentioned here, keywords and other annotation. And if we look at this catalytic activity here, it comes from Arba in this case. So Arba is constructed according to the annotation found in the Swiss Prodentary, but it's a multi-class learning system which is trained on this Swiss Prodentary, but these rules are not validated by operator. So the rules here are the protein match an interpro signature too. And the taxonomy is for data. And because there are these conditions here, then the sequence is annotated and linked to the catalytic activity which is described here. So in tremble, so this is the number of enzyme in tremble and you see the number of unique rare reaction and the number of unique KB which are available yet in tremble. Marie-Claude, sorry to interrupt, it just to say, can you come back to the last slide? So it just to say to the people because we can see here that the number of reaction provided is greater than the rare content. It just because Marie-Claude take into account the undefined direction and the case where some directed reaction have been annotated. So you have undirected plus the physiological direction left to right or right to left. This is why the number of rare ID is greater than what we have in the current release. So we, in fact, in terms of unique transformation, it's a little bit less than 10,000 in Swiss product. Sorry, I made a mistake when I gave you the numbers, Marie-Claude. No, it's also my fault. Thank you for the comment. So thanks to all these information which are stored in Unipot, Swiss product and tremble, you can do quite complex query by using just the Unipot website. So for example, you can use the simple search and all resources, so meaning individual entries as well as set of retrieved by query will be accessible using simple URL that can be bookmarked and linked and used in programs. So here, for example, you have a URL with that link to a specific tremble entry. And here you have a complex URL for all the protein which are the UN protein which are associated with the disease which involve dopamine, which are linked to a dopamine. You can use, of course, the advanced search. When you know the different section of Unipot entry, you can create the different section of an entry. If you don't know anything and you have no idea where your query is found, you can just look in the field here and if you tap Rea or KB, it will give you the field in which you find a Rea identifier or KB identifier. Or you can choose the field directly. So for example, if I say KB, you will see that you can find a KB in the function catalytic activity, as I described it previously. Or you can also find KB in the co-factor section. I've not mentioned it, but it's also a section where you can find a KB identifier. And if you query by small molecule, you can query at the same time the catalytic activity and the co-factor. And as we are going to map post-translational modification to KB and also the binding site, the ligand to KB soon, when you will query small molecules, you will query in all these different sections of an entry. So one of the example I've shown before was looking at how many human enzymes are involved in dopamine and are associated with the catalytic disease. So if you query, for example, just for uniprot for Homo sapiens, you just start typing Homo sapiens, you will see Homo sapiens and the taxidermy appearing. And this is the number of entry you get in Swiss plot in tremble all together. And you have a mention here of the proteome. So in, as I said before, we provide a set of records of proteins which are supposed to be expressed by a given organism whose genomes have been completely sequenced. And these sets of protein, of selected protein are called proteomes. So for human, we estimate that there are about 79,000 records, which are the product of all the human genes. And then if you restrict the query to proteome, you see that you have a number of entries in Swiss plot and a number of entries in tremble. For some organism, all the genes are in Swiss plot. It means that when you look for proteome, you have a complete set of genes in Swiss plot, and this is the case for humans. So the number you see here corresponds to the number of genes in the human genome. But this is not the case for all species. So for human, it is the case. For saccharomyces cerevisiae, it is the case. All the genes are in the Swiss plot section. But for example, for Drosophila melanogaster, who has about 13,000 genes, only 3,000 genes are in Swiss plot. The rest are still in tremble, so meaning that they are still only annotated by computer. So if you have any question at that level, never hesitate to contact us. It's very important because when you want to work on a complete proteome, you have to know what are the data sets you're going to use. So for human, I said there are 20,360 human genes, and then you can select the one which are annotated with the disease. So you select pathology and biotech disease, and you see that there are about more than 4,000 genes which are involved in the genetic disease in human. And then if you look at the protein which are enzymes in the human proteome. So this is the query. You should find entry which have at least a catalytic activity or a number. And just for this point here, it's not possible to know the number of enzymes which are encoded by the human genes. So nobody knows for sure. This is the answer of Christian access and to a user request. We don't know because we don't know the rules of a substantial number of human genes. For example, 5,000 human genes, we don't know the function of this gene yet. And also some, as Anne says, some known enzymes are currently missing EC numbers. So this is also why we could not have a complete set. And also there are a lot of enzymes which are complex, which are interactive, which are different, which are composed of different subnets. So what you count as a protein enzyme and also if you count all the isoflones. So Christian thinks that there are about 40% of human genes which are coding for enzymes. And then if you mix both query, you can find at the end that about 1,400 genes are enzymes which are involved in genetic disease. And if you add dopamine in your query, you get the three genes I mentioned before, which are involved in a genetic disease and which are metabolizing dopamine. So this table of results can be downloaded, can be customized and can be shared with your colleague. So if you want to customize your result table is the same as in the Rea website, you can add column and you can save. And then you will have more specific information such as the EC number, the Rea ID, and the different name and description of the genetic disease controlling to this human gene. And you can share the URL with your colleague by clicking on this button here. And this is the URL. And if you want to add column, you can also do it either manually or by doing the query directly on the website. And if you want to add additional information, you can go to our help page following this link. Okay, so as I don't know, maybe I will skip this because we are quite short in time. So there are quite a lot of different query you can do on the Unipod website. You can look at all the enzymes which are linked to lipid, which are the 3D structure. You can look to all the proteins which are linked to lipid, which are formed in the Golgi. You can link for protein linked to lipid which contain at least one frozen brain domain. So there are a lot of different query you can do. And if you go to this link here, you will see other examples of complex query which can be done with the Unipod website. You can also download your query in different format, HTML, FASTA, LDF or text. And yeah, and that's it. So as a summary, you can see that Unipod now by having integrated, integrating KB and REA provides machine-readables, small molecule data, and you can really have access to a lot of data in a very nice manner. And it's improving the support for computational studies of metabolic systems on the enzyme function and evolution or mixed data and resource integration. But as I said from the beginning, beware the difference between the Unipod SwissProt and Unipod Tramble entry content because depending on the dataset we have, the statistic you will do on it may be different depending if your records come from SwissProt or from Tramble. And as mentioned by Anne and me, you can access Unipod through the website, the rest up in API and Sparkle. And Anne is going to give you some examples of Sparkle query in the next part. So this is the SwissProt team and I thank you for your attention. Let's start with this last section about Semantic Web RDF Sparkle. So I will present a brief introduction of what Semantic Web is. Then we will see what RDF and Sparkle means. And I will show you a free example of Sparkle queries using RIA and Unipod data. So what is Semantic Web? Semantic Web, also called Web 3.0, is an extension of the worldwide web flow standard set by the W3C consortium. The goal of Semantic Web is to make internet data machine readable. So we have seen that we presented to you several ontologies this afternoon. So KB, RIA, Uniprot, et cetera, that describe concepts and relationships between these concepts, these entities and categories of things. And now the Semantic Web enable the encoding of Semantic with the data with technology such as RDF for resource description framework and OWL for web ontology language which are similar technology, OWL, allowing reasoning of our data and both technology allows to operate with heterogeneous data sources. So according to the W3C, the Semantic Web provides a common framework that allow data to be shared and reused across application. Enterprise is not really our scope and community boundaries. Say it differently, we can say that the Semantic Web is therefore regarded as an integrator across different content of information, application, and systems. The story began quite early in the group, so it began in 2003 in the group of Nikola Bredaski. It was with Eric Jain, but now Jan van Bolman is really our specialist of Semantic Web technologies. So what is RDF? RDF is directed the label graph data format for representing information. So it goes for the web. RDF for resource description framework is organized as a triple. So sometimes you can see RDF triple where you have a subject, a predicate, and an object. So it's very simple. So here you have a text and you have the same representation in a graph model where you have two nodes, subject and object, and an edge, which is predicate. Let's say a simple example. For example, Tom lives in Geneva. So you can have this graphical representation. We have a node for Tom, which is a subject. The predicate is lives in and the object is Geneva. We can also say Geneva is in Switzerland and an object can be the subject for another triple. So in that case, Geneva becomes a subject and is in Switzerland, which is the object. So if for human being, it's very easy to infer information with this kind of graph. Tom lives in Geneva. Geneva is in Switzerland. We can infer that Tom lives in Switzerland. It's very complicated for a machine, but thanks to this new format, it's now possible to do such inference. How the subject, predicate, and object are represented what they are containing. So for most of them, they contain what is called URI for Uniform Resource Identifier. It's a compact sequence of character that identifies abstract or physical resource. You have here two examples, one for Uniprot and the other for Ria. We will see later some concrete example. You can also use a literal like string, integer, float, Boolean with true or false value or literal like date. Why it's important to have this URI? Let's say that I am 9993. Would you like a floor? But what is a 9993? It can be an NCBI PubMed. It can be the identifier for an NCBI gene or it can be an NCBI taxon ID for Marbotta, which is certainly what this Marbotta wanted to say. So all this resource use the same number, but it's very important to have this prefix to distinguish between them. So this URI that we can use in RDF representation as it's a little bit complex to write. It's a little bit over verbose. So we can define some prefix. So let's say that I define the prefix PubMed which correspond to the first part of this URI. I do the same for taxon and NCBI gene. And now my free identifier are associated to the prefix and it's easier for human to read. So here in an example of some triples from Uniprot resource. So we can express that this Uniprot accession has a mnemonic, which is PNCA E2. This entry is also classified with a given keyword. I'll show you my clone. And this keyword has a label that is antibiotic resistance for instance. So all the information that you have seen this afternoon can be translated using this format. So it's really for machine, not for human, but everything can be explicitly defined. There are several format like total or RDF XML. The purpose is not to go into the detail of this format just to know that they exist. What is important is that in this format we have one URI resource. So one thing, one identifier to be sure that we can connect things together. And also something important is that now there is a large community using a semantic web and we can reuse in some project ontologies that have been defined by other groups and we can work together in order to simplify the work. So now let's go to the next part, which is Sparkle. So Sparkle is the query language for RDF. Sparkle is a recursive acronym. Computer scientists lot recursive acronyms. So it stands for Sparkle, protocol and RDF query language. So Sparkle is a query language which is used to express query on the RDF data sources using once again a set of triple subject, predicate and object to perform the query. For those who know a relational database, SQL language, it's very similar. Sparkle supports aggregations, queries, negations. So something that is similar to what you can find in a relational databases. And Sparkle endpoint is an HTTP service that is able of resizing and processing Sparkle protocol request. So the Uniprot Sparkle endpoint is available at this address, Sparkle.uniprot.org. It is available from the Uniprot website. So you have an entry point here on the Uniprot web page and it opened this window that contains several sections. One section to put your query, some example to start with the resource as well as some documentation. So all the information in the Uniprot project as a whole is represented in RDF and it correspond to 49 billions of triple which is a very big amount of data. You can query all this data from the default graph. But if you are interested, you can, there are some NEM graph that allow you to specific data sets. So we will show some example with Uniprot KB. You understand SwissProt and Tramble, but you can see that in this data set, we also have information on the enzyme classification, Uniprot enzyme. So this section represents the entire data set for the EC numbers and all their associated information, for instance. In Rear, we have also a Rear Sparkle endpoint at this address, sparkle.rearifendb.org. The Sparkle endpoint is also available from the Rear web page and it opens this window which has a similar organization with a place to put your query, some example and some documentation. In this case, you can see that it's a very small data set compared to Uniprot. So the Rear Sparkle endpoint contains data from the Rear.rdf and the KB.OLA file. So we have Rear and the snapshot of maybe that is synchronized with Rear. And for both, we only have three million of tripers. So we are very, very small compared to the Uniprot data set. And as I told you before, these data sets are synchronized with the Uniprot.rdf data releases. So let's see some example, the power of this new technology and how we can bridge chemistry and biology with different resources we saw this afternoon, represented this afternoon. So as I told you, all the information encoded in a display in the Rear website is encoded by this data model where you have the main classes of the domain. So for instance, a reaction is composed of two, so this entity is composed of two reaction side, reaction side, contents, reaction participant. Reaction participants are linked to KB, but we have seen that we have different kind of participants. We have small molecule with a direct link to KB. We have also, so we have some difference in the naming because for historical reasons, so macromolecules are named in the Rear.rdf generic compounds. So we have generic polypeptide for the protein and generic polynucleotide for the nucleic acids. And these macromolecules have a reactive part and this reactive part is linked to KB. And in case of polymer, the polymer entries are linked to the underlying KB polymer. So let's start with a simple query which is a query simplified. So it's a query example provided in the website but it has been simplified for this afternoon where I would like to retrieve all Rear reaction and very question that have a specific KB. So the KB encoding describing a glutamate as Reaction Participants. So this query, we make something that we can do on the website using this query. We just have to query with the KB idea as we seen but just for illustration. So here on the left, you have the Sparkle query. You have three parts. The first part we define the prefix, then we define what we would like to select. Here I want KB Rear ID and equation. The question mark in this technology means that it's a variable. And I put the constraint I want to verify. So I can show you the same query but now graphically. So I want to retrieve some reaction. So a variable called question mark reaction that is a subclass of the Rear reaction. I want to retrieve for the predicate each average equation, its reaction equation. This reaction has a reaction side. Reaction side contents Reaction Participant. Reaction Participant contents Compound and this Compound are linked to KB. And I want this KB to be restricted to one specific value. So you just have to put the Sparkle query into the, to fill the textbook with your Sparkle query. And then you click on the submit query button and you get a table with the result that correspond to your query and fill the constraint. So in fact, doing Sparkle query, it's like, so in the case of this query, it's just like a sub graph of the whole Rear graph. Okay, so we start from Reaction and we go to KB and this is what we have did with this, what we did with this query. So once you have designed one query, you can obviously change just the KBID for other KB of interest and you can reuse your query. So it's very easy. So that's true that the learning curve is a little bit hard with RDF Sparkle, but once you can start with it, it's a very powerful tool. And we are welcome to help you to design your queries if you need. Let's see a second example. I would like to retrieve Human Enzyme, Reviewed Human Enzyme Metabolizing and Acille Swingosina. So Anacille Swingosina is a class of compounds with a non-defined group, an air group, and I would like to retrieve all the children or descendants of these compounds that can be participant of Rear Reaction. The first part of the query, Retrieve Reviewed Human Enzyme, cannot be done in Rear because it's the scope of Uniprot. So we don't duplicate the information. So we can set the query that correspond to this part in Uniprot. So I want to retrieve protein that are reviewed. True. So it's to retrieve a Swiss proton trace. I want this protein are linked to taxonomy ID 9606, which is the taxon ID for Homo sapiens. I would like this proteins to be annotated to have catalytic activity annotation and associated to Rear Reaction. Uniprot IDF only contains Rear Reaction URIs, no more information on the reaction. Now we go on the Rear side. So in Rear, similarly to the query we performed previously, we can have the same kind of queries and the difference here is instead of giving a specific KB ID here, I want to retrieve all the KB that are subclass of, so all the descendant of the KB that uncut the N-acille sphingozyme or N-acille sphingozyme for NIN as we call it in Rear and Uniprot. As the URIs reaction are common, both to Uniprot and Rear, we are now able to merge on this, to join on this identical URI and we can perform what we call the federated query. So here is the query corresponding for to answer our question. So it will be sent from the Rear endpoint. So you remember the prefix, the select and the condition and here we have an additional block which is a service to the Uniprot endpoint with the query for this part. So say differently in terms, so our query is to retrieve compounds with reaction catalyzing N-acille sphingozyme and their human enzyme. So the user send the Sparkle query to the Rear Sparkle endpoint with a constraint on KB and Rear. The subset of a result are sent to the Uniprot Sparkle endpoint. It filters on Uniprot proteins that correspond to the constraint N-CBI tax ID and the corresponding reaction. It returns the proteins result to the Rear endpoint and it returns the result to the user. So it's an example of one federated query using one service. So here is the result of our query in Rear. So we have a table with our result. In this table, you have some links. So for example, this is a URI of the Uniprot entry but we provide a redirection to the Uniprot website in order that you can access to the full content of the resource. Similarly, the Rear URI are redirected to the Rear URL corresponding to the query for the specific reaction. I show you an example where it was starting from the Rear endpoint but we can do exactly the same query but from the Uniprot Sparkle endpoint. So here I send the query and I send the service to Rear and we got the same result with links to the different resource as I showed you just previously. Now let me show you a more complex example which is a spatial metabolomics use case where for example, we would like to create a mapping of small molecule, let's say sterols to anatomical structures. So we don't have this information in Rear we don't have this information in Uniprot. So we need to use a third resource. This requires the created gene expression data from BG which is another SIB resource. And now I can set my question as where are located the human genes encoding enzymes that metabolize sterol derivatives. So we want this query will map all the members of the class of sterol to an anatomical structure in human. So here it's a query with two services, one for us. So a query executed at the Uniprot endpoint with two services one for Rear to retrieve the Rear reaction. So you see the KB that are subclass of this specific component. In Uniprot, we constraint on the taxonomy, taxon ID of human. We want Swiss proton trees. We want an annotation, catalytic activity annotation to get the Rear reaction. And additionally, what we will add is the link between this Uniprot entry and ensemble gene because BG has also links to ensemble genes. So we can join all data on these ensemble genes. And from BG, we can retrieve the corresponding gene, BG gene and retrieve the anatomic entity ID that are encoded in different ontology, one of them being Uberron and also the name of these entities. So this is a schematic representation of the same process. We send the query, we go to Rear, we come back to Uniprot, we go to BG, we come back to Uniprot, and Uniprot endpoint will send the result to the end user. So once you have your query design, you put it in the Sparkian query endpoint, you submit and you get the result as a table, as I showed you previously. And you have several ways to export the format, Uniprot format, XML, JSON, CSV, like comma-separated data, and you can also share the URL if you want to redo this query later. So here, it's an export of the result and from this query, we can get a selection of more than 45,000 results. And for each human enzyme, we can have the location of the chemical entities, the sterile derivatives and their location in loco site, live or whatever we have. So this is a really, really powerful tool. I will stop showing over example, but just to give you the flower, in order that you have the flower, but you can explore and exploit the enormous richness of Uniprot with this kind of query. You can, with Sparkle Query, perform structural searches like IDSM-Sachem, there are IDSM-Sachem is used in the real website, but behind it's Sparkle Query, so you can do exactly the same directly in Sparkle. You can bring the data to disease as Marie-Claude show, co-factor, PTM, pretty structure, similar sequence with Uniref cluster or homolog sequence with other resources like AutoDB and OMA that are over a cyber resource and referenced in Uniprot. You can play with taxonomy. For instance, I want to retrieve all the reactions that are only found in bacteria or arcade, whatever you want. And we can also work on a complete protocol. So during the practicals, we will do some of our keys. And Sparkle can be used also in a programmatic access. I show you with Sparkle and pointer interface, but you can perform Sparkle query from using different programming language. So for me, I am using a Jupyter notebook and Python to process the data and check data consistency. So you can perform, you have an example of Sparkle query that you can process and then you can access to our Python library in order to draw and diagram whatever you want here. It's an example where we can compute the taxonomy distribution of the real reaction according to the Uniprot annotation. If you want to learn more about RDF Sparkle, you can access to this URL to this course when you will have over a course related to Sparkle RDF SIB resource like NextProt, like I say, BG, OMAR, AutoDB and all those resource. So accessing to this course will give you all the material available. And I think now we are near to finish this training course. So I would like to acknowledge all people involved in the RIA project. So the PI is a language and we have, I will show you the RIA bio curators and the RIA developers that are doing a tremendous work. And I would like also to thank the KB curators Adnan Garrett and directed by Andrew Leitch. We have also, we told you this collaboration, we had to go and react to RACTOM. So the go consortium in the person of Harold Dabkin, Chris Mangal, Jim Baloch, Paul Thomas and David Ilse and Pascal Godet that I forgot on this slide. I'm sorry for her. Peter, the statue for RACTOM and we collaborate also with IDSM guys, Jacob and Jerry. And we gratefully acknowledge the software contribution of KMAXON. And acknowledgement for Uniprot too. So as Mike Claude said, it's a consortium of EBI, ML EBI. So all the people are in green. American one from KTVU team are in blue and in red it's the Sib ones with an unbridged group. Voila. So I'd like to thank you for your attention. And if you have additional question, we can go to this session or maybe you are totally dead.