 So, hello everybody. Today we have the privilege to have a Pascal Godet for the Viet-Recipt Computational Biology seminar series. I will try to make a short bio of Pascal. So, she first got a PhD from Concordia University in Montreal, Canada in 2001, and then she started working in the field of bio-curation. Before joining the Swiss Institute of Bioinformatics, she became a research assistant professor at Northwestern University in Chicago on project-involving model organism databases and gene ontology consortium. She joined CIB in 2010 as a scientific manager of an expert database, which is a new innovative knowledge and resources on human proteins. And now she leads a team of experts who tries to integrate data from all the databases and publish studies and data submissions from researchers. And briefly, she's also editor from the database, and one of the founding member and chairperson of the International Society for Bio-Curation, a non-profit organization aiming to facilitate interaction between biologists and computer scientists developing databases. So, in this framework, Pascal will talk today about the curation activities going on at NextProt, activities that have become essential to the life science research community. So, stage this. Thank you, Diana, for the introduction and for the invitation. Just one correction. In fact, I'm not chair anymore of the Bio-Curation Society, so now it's Alex Bateman, who's chair. So, today I'm going to talk about the curation activities at NextProt. My talk is going to be divided into three parts, the status and new features of NextProt. Then I'm going to describe the manual annotation projects that are going on and a little bit on the work that we're doing for text mining to help us do bio-curation. NextProt is a human-centric protein knowledge-based that aims to capture, like Diana said, as much information as we can on human proteins. It's a resource that complements the unique prots, resources for which the Swiss prot of the SID is a member of the consortium. So, NextProt, in NextProt, the different protein pages are organized into different views. Those views are divided into describing the function, the expression, interactions, localization, sequence and proteomic data relevant to each entry. So, our main data source is Uniprot. So, as you can see, we have data from Uniprot in all those different areas, but we also integrate data from different sources in each one of those aspects. So, for function, we also integrate data directly from the GOA group at EBI that aggregates all the gene ontology annotations. For expression, we integrate data from the human protein atlas that has designed antibodies against essentially all human proteins, so that group is in Sweden. So, we integrate the tissue expression data that they have generated, as well as the sub-cellar localization data that they have, based on those antibodies. For expression, we also load BG data. BG is a group that you probably know developed here in Osam. For interactions, in addition to the interactions at the SID, that we load from Uniprot, we also load additional interactions from intact. Localization already mentioned. For sequence and medical data, we load, in addition to Uniprot, variants. We load variants from Cosmic as well as the BSNP. And for proteomics, finally, we load data from SRM atlas, peptide atlas, and we also load from the human protein atlas. So, this group in Sweden that does the antibodies, we load the sequence of their press. So, the protein ESTs that they use to immunize the rabbits to make the antibodies. So, these are the number of each of these data that we have in the different areas. One important thing for NexProt is to integrate data from many different sources. So, one of the important things that we wanted to do when the database was started was to create a more specialized database than Uniprot focused on human, but with extra data. Now, to add more data sets and data from other database, as you know, there's a lot of data out there. And so, one important thing for us is to only load the data that is deemed of high quality. So, we spent quite a bit of time curating those data sets and selecting what we want to load versus what we don't want to load. And so, in order to do this, we have a data quality assessment ranking in which we divide the data into three ratings, gold, silver and bronze. The gold data is the data that we deem a very high quality and whenever we have any kind of quantitative measurement of the quality, we aim for error level of less than 1%. Silver is between 5% and 1% error and the bronze data is any data that we think is probably less than 95% confident and we don't load that data into NexProt. So, there's lots of data out there that we just don't feel is of good enough quality to be loaded. The data that we load is really determined, you know, because everybody, like everybody else, we have limited resources, so it's determined by the needs that we have. So, right now we focus on two different aspects, most importantly, proteomics data as well as variants. The reason why we're interested in proteomics is that NexProt is the knowledge resource for a project called the Human Proteome Project. It's an organization within the UPO, the Human Proteomics Organization. So, this is a group of people worldwide, different labs doing proteomics, whose goal is to identify all the proteins in the human body. So, they want to identify all proteins, all proteins, isoforms, all post-translationally modified versions of the proteins, all variants and so on, and to identify in which tissue they're expressed, in which organelle they're expressed, if they're expressed in normal people in different diseases or during disease progression and so on, so the scope is very large. And we're part of this HPP as the knowledge pillar. And our role as part of HPP is to integrate the results of the mass spectrometry identification studies from the different contributing groups, as well as provide metrics to assess the progress of the project as they try to determine whether they identify all the proteins are not or what fraction of the proteins they've identified. Also, expand the representation of the functional knowledge on proteins. So, essentially, the more data that they have, understanding what protein does, that's also of interest to them. And very importantly, to validate protein existence based on these proteomic studies. So, in terms of proteomics data in NexProt, we have right now about 400,000 different peptides that we've loaded, and these map to 84% of our entries. We've also integrated the SRM atlas. So SRM, for those of you who don't do proteomics, stands for Selected Reaction Monitoring. And really in two words, it's a way to do quantitative proteomics. And this is really the next big trend in proteomics. So people are really interested in those SRM peptides. So we've loaded them. There is fewer SRM peptides. There's 144,000, but they map to almost all entries because they're designed artificially as reagents to be able to identify proteins. So that's why they can have such a large coverage. We also work to integrate high-quality, high-throughput PTM, or post-translational modifications from high-throughput papers. So we have almost 50,000 different sites on 14,000 entries. So that's about three-quarters of our entries. That's in addition to the data that we already have from Uniprot. So total we have about 80,000 PTMs. This is the proteomics view of NexProt. So all our views are organized in kind of a similar way. You can go from one view to the next using this left-hand menu. Here I'm showing you the proteomics view with some of the stuff I just mentioned. So you can see where the pressed. So this is again, like I said, the protein EST used to design antibodies by the human protein-night-class group. So this is where that protein EST aligns on the sequence. So their antibody binds the epitope somewhere around there on this sequence. We also have all the different peptides and all the different SRM peptides. And all the pages in NexProt that have sequences are divided into these three views. The graphical view on top. On the left, we have a feature table. And here there's a sequence. And those views are linked together so that if you click on a peptide, it will go in the table to that row. And it will highlight the sequence in the sequence view here. And here you have the ability to open the references and see what evidence the data is based on. So as I said, it's very important to... That's an interesting color. It's very important to validate the existence of proteins. And so based on the proteomics data, we have managed to validate so far. So we have a total of 16,520 proteins validated at the protein level in NexProt. And so that's about 2,500 more than what's in Uniprot. And the difference is these proteomic data sets that are used to validate. So the other area where we put a lot of focus is on sequence variation. And I'm going to talk about in the part on our manual annotation projects. We have some projects in which we annotate the functional impact of variants. So we've been putting a lot of effort into loading as many variants as we can so that we can more easily annotate their phenotypes. Our data sources right now are Uniprot, DBSNP, which is a database of small genetic variations, and Cosmic, a database of somatic variants in cancer. So we have 450,000 variants from DBSNP and 1 million from Cosmic. So some of them overlap. So we have a total of about a bit over 1 million, 1.1, 1.2 million variants in total. And we're in the process of loading ClinVar, which is a database of clinically relevant variants from the NCBI in the US. So when we load the data from Cosmic and from ClinVar, we weren't only interested in loading the fact that there's a variant that exists. We also wanted to load which disease that variant was associated with. And this proved to be quite challenging because there is no single vocabulary for diseases. If you guys have looked at this, it's very difficult. Also, different resources, of course, use different vocabularies. But here we have this problem where ClinVar uses many different vocabularies, depends on who submits the data. They're open to any vocabulary, including free text descriptions. And Cosmic has their own, what I call, annotation model. It's not exactly a controlled vocabulary. Also, Cosmic sequences cell lines directly. So we needed to be able to integrate all this onto a single vocabulary if possible. So what we did is we developed two resources that we called the Cosmosaurus and the Celesaurus. So the Cosmosaurus, the idea is to take the Cosmic terminology and map it onto NCI-Pasaurus terms. So the NCI-Pasaurus is developed in the US by the National Cancer Institute. It's one of the many disease vocabularies out there. The reason why we chose that is it's very comprehensive in terms of cancers. And this is what we were interested in at least at first. So how Cosmic described patient samples is using four fields. They have two fields about the tumor site. So they call it the primary tumor site and the site subtype, which is an anatomical part. And then they have something that described the type of tumor, so the primary histology and the histology subtype. So this is just to give you an idea of the process. I have three examples here. So for example, the first one is a simple one, Corrid Plexus Papioma, where the primary site is in the central nervous system on the Corrid Plexus. They don't mention exactly what the primary histology is, but the histology subtype is Corrid Plexus Papioma, which at first we thought this might be nice because these histology subtypes look like they were very specific terms that were easily mappable to a nice vocabulary. In this case, it matches perfectly. However, that's not always the case. So the other two examples show you this simple endometrial hyperplasia. You need to use all different fields in order to reconstruct that. Otherwise, if you were to only pick the subtype, you end up with simple, which is not something that was very useful. They also have multiple different ways to describe similar things. So for example, for skin cancer, not only they describe the primary site as the skin, but the subtype as on which body part that melanoma was found. And so for us, we didn't think this was very relevant. It's not in the NCI Tizaris. So we grouped all these under a single-term cutaneous melanoma. And sometimes they're being a little inconsistent in that all of these are malignant melanomas, but only some of them have in situ melanotonic neoplasm, and some of them have nonspecified in the subtype. So just trying to reconcile all that, we went manually through all the different combinations of these four fields in cosmic. And from almost 2,000 cosmic diseases, we ended up mapping on 775 NCI terms. So cosmic also sequences cell lines. And cell lines, if any one of you are familiar with cell lines, they're widely used there in a lot of papers doing human research. However, they're shared around different ways. Some people acquire them directly from cell line collections, while other people send them to the lab next door or obtain them from their colleagues. However, there's a huge mess in terms of standardization. The names are very confused. So sometimes a cell line will have, usually it's going to be letters and numbers. So sometimes it will be altogether. Sometimes it will be with a hyphen, with a space. Sometimes it's going to be a clone. So it's going to be dot one. Some people are going to drop the hyphen, and that's going to be the same cell line, or sometimes with and without the hyphen is different cell lines from different origins. So you cannot just use the cell line name that you find in the paper and be sure that you're looking at the same cell line even from another paper, even if it's written exactly the same way. So in terms of resources available, there was nothing complete. There were obviously the catalogs of ATCC and the other cell line resources, but obviously those only have in their catalogs, but they sell. There was one small ontology, but again it was very incomplete. So we developed the cellosaurus, which is a tizorus of cell lines, and Amos did really essentially all the work here. He now has 30,000 different cell lines in this vocabulary from 255 different species, most of them from human and mouse. There's thousands of synonyms, references, and cross-references, and so we use these terms to annotate the cell line origin of the cosmic data and all the other data where we have cell lines. So if you're interested, you can also download this from our FTP site that's there, but it's also loaded in NexPROT, so you can do a certain NexPROT for the cell lines and find them. So this is an example. This is a breast cancer cell line, so it's called BRCA5 with or without a space. However, it's been shown in papers to be simply a HILA derivative. So this can confuse some research if you don't take that into account. The other thing I would like to mention about NexPROT is that in the past year we changed our architecture quite a bit, and the goal was to implement advanced search functionality, to manage protein lists, and to build analysis tools on top of NexPROT. And the way NexPROT was designed before we weren't able to do all these things very easily. So I don't want to spend any time explaining the technical details of this. The slide is really just to say that the changes in infrastructure, if any one of you have done this, it's very difficult, not very rewarding work, because you don't have very much to show at the end. So the slide is to really thank the effort of our software team that spent a lot of work last year redesigning the back end of NexPROT. And really what's important, at least for me, is that now the API is decoupled from the other services, and what that means in practice is that it gives us more flexibility to design other tools later. And the first tool that we designed using our newly engineered NexPROT site is our search. So now we have two different types of search mode. We have simple search and advanced search. The simple search... Yeah, this is going to come again later, but I want to make sure I tell you that the new search is not yet available on the NexPROT production site, but you can go to search.nexprot.org and play around with it. The simple search works very similarly to our current search. The ranking is a little different, but otherwise it's the same. We do have new functionality of sorting and of creating lists here, so you can select all the entries or some of the entries, and then it'll put these entries in a little basket there, and then you can go away and manage your list in some of our tools that we have. You can also filter the entries based on whether they have any data for here, for example, disease, expression, mutagenesis, and so on. So this filtering functionality already exists, but I thought I could point it out again in case you haven't seen it. Yeah, so that's the... The advanced search... So once you're in this search.nexprot.org, the advanced search, you can access it through a very tiny link here just below search, so I wanted to make sure I showed it, because even for me it took me a while to see it. The advanced search is based on an RDF representation of our data. So we remodeled also all the data into this subject predicate model format in order to be able to do these advanced searches. So this is based on... The advanced search is based on Sparkle language, so it's the RDF query language, and if you used it, or if you haven't used it, but I can tell you the language, the syntax is not super user-friendly, and not only that is that, even if you know how to do it, you need to know about the data model in order to do a query. Even if you know how to do it one place, you might not be able to do it easily on another site. So to try to get a kickstart for the users, we generated... We pre-computed... Right now there's 112 different queries, and they explain in free text what you're doing, and then you can click on any one of these queries, they will launch it, and you can then modify the query more easily. And we also have extensive help on the data model and so on, and of course we're available if you just want help to design a query. But so this is one of these queries, so this one is proteins that are enzymes and that have at least one metagenesis site that decrease our ability to enzyme activity. So this is the kind of thing that we couldn't do on the simple search, which is just text-based. And so the results of the search go to the same page of the results from the simple search with the same functionality of sorting, filtering, making lists, and so on. Yeah, so this is one data set from that result. This is one data point. So in this case it's this enzyme, so it's an enzyme, it's got one mutated residue here that is shown to strongly decrease the enzyme activity, so you can follow through and see the data. So just to remember to tell you that on the advanced search, people can use and create and save their own queries. So you need to create a login in order for us to know that the query belongs to you, but you can definitely do that. You can also do, this is the advantage of Sparkland points federated queries with other resources, so we have examples of that in the list of pre-designed queries. And like I said, there's extensive help. Okay, so the second part of my talk is what we do as manual annotation projects. So this is really where I've been spending most of my time since I joined the group. We have three different manual annotation projects that we've undertaken, one on protein kinases, one on cancer variants, and one on sodium channels. So in order to be able to do annotations in NexPROT, we designed a whole new tool that we called a bioeditor, and this is a web-based tool that we used to do annotation. So I'm going to first describe how the bioeditor is designed. So we do annotations directly in triplet models with, again, this subject relation or predicate object model. So because we annotate proteins, the subject is a bio-object, which we call a bio-object. So either a protein or an isoform or a protein with a PTM modification on it or a variant or a complex of two or more proteins. The relation is derived from an in-house developed vocabulary because there wasn't anything out there that really met our needs. We could contribute to relation ontologies that exist out there, but there was a time where we needed to get going really fast, so we just went ahead and created what we needed. And the relations we have are, for example, binds phosphorylates involved in, regulates, causes phenotype, causes disease. There's currently 82 relations in that vocabulary. And then the object can be, again, a bio-object. So if you say, my protein binds another protein, it can be a chemical, my protein binds calcium, it can be a genontology term, my protein is involved in wind signaling or a disease. My protein or a variant of that protein is found in a certain disease. And how we use, so sometimes we also felt the need to create extended triplets, so the triplet format was a little restrictive. So an example of a restricted triplet, extended triplet is lin phosphorylates BTK producing, and then we have the exact protein with the modification that's produced by this phosphorylation event. So the annotation scope of the bioeditor is pretty vast. It's divided roughly into two different aspects, the function of the protein and the role of the protein in disease. So in terms of function, we capture molecular processes, what we call molecular processes that are the direct interactions and substrates of proteins. So what proteins bind to phosphorylates are other post-translational modifications. We also capture genontology biological processes. I put here under function variant phenotypes because it helps us to understand the function of a protein, but it can be considered also a little bit the role in disease, but we annotate directly if a variant has a phenotype, for example, a decreasing binding to another protein. I'm going to talk about this more when I give you a specific example. In terms of roles in diseases, we have misexpression of proteins in diseases, so if the protein or the RNA level are increased, decreased. In diseases, if there's any epigenetic effects, if the protein serves as a biomarker, for example, anything to do with the protein, either a variant or either an expression level and so on that can be associated with the establishment of a diagnosis with prediction of disease progression or response to different treatments. We also annotate variants that are found in patients and associated with disease. So some of them of the sequence variation with diseases, we annotate using directly cosmic. So these are automated, but we also annotate from the literature when we find extra ones. And we annotate if proteins are relevant for animal models of diseases. So these are a few examples of annotation statements that we make with the bioeditor. So we can have the simple rock 2 binds NPM1. We can have a biological process, rock 2 positively regulates centrosome duplication. And here in green, is it green? In yellow. We have the ability to write some extra free text if we feel that the triplet is a little dry. This is an example of the misexpression in liver carcinoma and rock 2 serves as a prognostic biomarkers for bladder carcinoma. Now each of our annotations are supported by evidence. So we can have one or more evidence supporting each of the statements. The evidence is divided into two different areas, different parts. One is the evidence code that comes from the evidence code ontology to which we've contributed about 150 other terms. But the evidence code ontology contains things like western blood, knockout, mutagenesis, things like that. So with experiments we're done. We can also provide a little bit of experimental detail if we feel it's necessary. We also capture in which biological model or different organ or different cell line the experiment was done, as well as the protein origin. And the reason why we make this distinction is that sometimes they will transfect, let's say a mouse cDNA into a human cell line and we want to capture both these things separately. And obviously we capture a reference for each of the annotations. So the reference can either be a PubMed, a publication. It can be from a website or a database. I mean, the vast majority are from publications. An example of an evidence. This is an annotation I showed in the examples before. ROC2 positively regulates centrosome duplication. And so we have, in this case, it's a constitutively active mutant. And they did heminocytochemistry where they transfected ROC2 and they labeled the cells with... they labeled the microtubules with gamma tubules and immunostaining. And in this case, they transfected ROC2 from beef into mouse cell lines. Another thing that we can do with the bi-editor is we can connect annotations together because otherwise the triplets are floating around and it's difficult to understand the biology. So this is an example how we capture the fact that BTK, blink, SICK, and PLC gamma2 play a role in the B-cell receptor signaling pathway. So these different annotations are all connected together via another relation. So we connect those triplets. So we have the different steps. BTK binds BL, blink, producing a complex BTK blink that complex translocates to the plasma membrane. SICK, another protein, comes in and phosphorylates the BTK-BLK complex on BTK. So it produces BTK phosphorylated at pt5151. BTK phosphorylated at pt5151, autophosphorylates on pt223. And then that doubly phosphorylated form phosphorylates PLC gamma2. All right, so we have the whole chain of events. And then we have all these individual triplets are also part of the B-cell receptor signaling pathway. So this is our bioeditor annotation workflow. Our team of curators read papers and they enter the data in the bioeditor. The bioeditor has a number of validation rules that must be met for an annotation to be valid. So you need to have an evidence. You need to have a species and so on. The protein that the annotate to needs to exist, there's a number of things like that. As long as the annotation is not valid, the curator fixes it. And then once, so usually we annotate entry by entry, so we do one protein can be various aspects, but we annotate something that makes sense biologically to annotate. And once that entry is finished, annotated, we send it to the QC, which is supervised by an exam. And she does a manual review of the annotation. So it's pretty global, but usually, I don't think it's ever happened that there was no feedback. Usually there's some feedback sent to the curators with questions and suggestions for improvements. And there's a round of correction. And after that, when everybody's happy, we can export the data. So next I'm going to describe these three projects that we're working on with the bioeditor. The first one is what we call the Kinase Knowledge Platform, or KKP. So this was a contract with Mark Serono, back when they were in Geneva, aimed at providing annotations to support their drug screening platform. So this contract ran for two years in 2012 and 13, during which we annotated 300 of the 500 human protein kinases. So we read 13,000 different papers, produced 30,000 different annotations. And so that was a huge amount of work done by 11 people and a subcontract with molecular connections. So these 11 curators were curators from our group, as well as some people we had subcontracted from. This is just a little cytoscape illustration of the data for four of the proteins, the ones that are involved in B cell signaling pathway. And this is just in terms of the function. So what substrates those proteins have? So the four targets are in red, at least on my screen. And these are the substrates of these four targets only. So you can see how we rapidly have a lot of annotations doing this binary annotation. So we would love to finish the kinase project, that is to do all the 500 kinases, but right now we don't have any funding to do this. So we'll see what happens. Second project I'd like to describe is called Caviar. So it's annotation of cancer variant. This is a collaboration with Sofia Genetics here in Lausanne. And it's funded by the Swiss League against cancer. The idea is to create a corpus of manually curated protein variants in order to improve the interpretation of variations in patients. So Sofia Genetics is developing software to help clinicians analyze their next-gen sequencing data from patients. And they do integrate a lot of data, like SIFT and polyphen and so on. But we also want to enhance this with direct functional impact of the mutations in those different variants as much as we can. So right now we're focusing on two main diseases, Lynch syndrome, which is defect in mismatch repair genes, and hereditary breast and ovarian cancer, which are usually caused by defects in BRC1 and BRC2. So what we capture in terms of the functional classification of variants is any defect that has to do either with the enzyme activity or changes in binding partners or impacts on cellular processes like DNA damage or cell cycle progression and any other phenotypes. So we go pretty vast. So even if there's a mouse knocking where they made a mutation that is similar to a human mutation and they find sometimes a little... We don't know how relevant the phenotypes are. The cells might die earlier or the... But we do capture as vastly as we can in order to, you know, if we find human mutations in these variants at the same position that we can know that there's an effect in a model system. So again, this is based on the triplet annotation model. In this case, the subject is a variant. The relation is something like increases, decreases, has phenotype and has normal, because we also capture any variant that have a normal phenotype because that's also very important to know that. This has been tested and found to be normal. And the object is, again, either a bio-object. So we can say this variant increases binding or decreases binding to a certain substrate. We can have genontology terms. This variant decreases DNA mismatch repair or a term from a phenotype ontology. And we also capture the intensity of the phenotype because sometimes... Well, we felt it was necessary. Sometimes you have a very small effect, but you don't want to say it's normal. You want to indicate there's an effect, but sometimes some mutants cause really big effects, on the protein. So we wanted to be able to distinguish between those cases. So we captured, just as a qualitative label, mild, moderate, and severe, the intensity of the phenotype. So examples of these annotations of variants... Oops! ...are BRC1 with a mutation at leucine 63 to phenylalanine, decreases binding to UB2D1. And these two phenotypes are related, right? Because since it doesn't bind the UB2D1, which is the UB2 ubiquitin conjugating enzyme, BRC1, that mutant, doesn't have ubiquitin protein transfer activity. It doesn't bind the UB2. It cannot transfer the ubiquitin. So this is... Oh, there's animation. You cannot see this at the bottom. Okay. So this shows the positions of all our annotations on BRC1 versus the position of all the variants. The bottom line is all the variants that we've integrated in NexPROT, so from Cosmic and everything else, they're distributed throughout the sequence almost evenly. But the variants that have phenotypes are mostly in the domains of BRC1 that are known to have a catalytic activity. So this is the ubiquitin ligase domain at the end terminus. And at the C terminus, you have regions that are important for bindings to important interaction partner for their role in homologous recombination. So it's interesting that the phenotypes... I mean, it's not extremely surprising, but the mutations with the most impact are in these important regions of the protein. There are a few mutations throughout that does cause phenotypes. I guess we don't... I don't understand exactly why I need to look at that in more detail. So it's not that it's completely restricted here, but it's certainly more focused there. And the other thing I should say is when you read these papers, most people test what they believe they can test. So the ubiquitin ligase activity will not be tested for mutations that are completely off the ubiquitin ligase region. So these next couple of slides are to show how these annotations could be integrated into the SOFIA genetic software to present the information to clinicians. So this is a list of variant mutations on the BRC1, BRC2 protein. And then if there's a variant with an impact on function, the person could click and open a pop-up that will show all the different phenotypes associated with this one mutation. And the relative intensity here is shown as the full red circle. This is severe and the white circles are normal. So you have an idea that this one impacts a lot of different aspects of the BRC1 function and pretty severely. So if a patient has a mutation there, some action should be taken. And then if you click through on any one of the annotations, you can see the evidence of where is the information coming from. And this is something that when we talk to SOFIA genetics and they talk to their customers, this is something that people are interested in. They don't want to know just the conclusion. They also want to be able to go back to paper and see by themselves if they agree with the conclusion that we've made. The third project that we're working on is called NAVMID Predict. This one is again on variant, but it's on SODIUM channel. So we would like to predict the pathogenicity of variants in channelopathies. This project is a collaboration with Eugab Brielle at the University of Bern funded by the SNF. So the idea is to again similar to the cancer project, but to assess the severity of mutations. So the mutations in SODIUM channels, this is again a completely different area of biology, but SODIUM channels regulate the nerve impulse and they can cause different effects such as epilepsy, migraine, neuropathies, paralysis, cardiac arrhythmia, depending on which protein is mutated and also which residue of the protein is mutated. So that can also change the disease. So the project has three parts. One is to produce corpus of annotations. So the same thing that we're doing for the other project. Secondly, using this information to produce a tool to predict the pathogenicity of newly discovered variants based on the knowledge that we're gaining by curating the literature. And finally, some of these predictions will be validated in the lab by Dr. Abrie's group. So what we did for this project is there was no ontology to describe the subsets, steps of the SODIUM function. So we developed a simple, not very big ontology, but that allows us to describe precisely the phenotypes that we absorb. So the channel is normally in a closed state and then it can go, I don't know why this is all animated, sorry. It can go to an open state by a step called activation. Then it's open, the ions go through. After that, the channel is inactivated. During this stage, it cannot reopen again, but it doesn't transmit current. And then it needs to go through a recovery step to be closed again and then again be able to start the cycle. So mostly it goes in this one direction. Sometimes there's some steps that can go in the reverse direction and some mutants have increased levels of these wrong direction reactions. So we've captured in the electrophysiology ontology all these different steps in all the directions. So this navmit predict tool that we're going to develop, so we haven't started yet, but the idea is to take into account all aspects of the channel sequence and function. So all the sites, all the domains, all the interaction sites, and all the functional impacts of the variance that we are now curating. And then using this to calculate the score predicting the pathogenicity of new variance based on the knowledge that we have. Last little part is about text mining approaches to help us do biocuration. So this project is called Nexpresso. It's a collaboration with Patrick Ruch in Geneva. Again, it's funded by the SNF since last year. So the goal of this project is to integrate text mining tool in our manual curation workflow. So we want to help curators find papers that find, well not just find papers, find data that is supported by experiments. So we don't want to have just statements in the abstracts or general things. We want to find specific information. So sometimes you can just get cancer. We would like to know exactly what cancer is described in the paper. And we also like to have non-redundant information because in my experience, the text mining tools are very good to tell you what you already know, but they're not as good to tell you new things because they need to be trained on the things that they don't know. So we're trying to gear the system so that it's leaning a little bit more heavily on the non-redundant stuff. So what the software needs to do to help the curators extract information is to be able to recognize entities. So entities are proteins, or protein names are variants. And by entities, it's not just recognize a protein name, but try to map it to a database object. So it's not only good enough to know that it's p53. I would like to know if it's the mouse or the human p53. It's also nice if it can recognize concept, either gene ontology or diseases, and to show us where in the paper it found the assertion in order for us to go back and verify whether we agree or not with it. So this is really more what the group of Patrick is doing, and right now the project is a stage where it's mostly on their side to do most of their work. But this is an idea of where we would like this to go. The curation workflow using Nexpresso would start from a huge corpus of literature. So we all know that there's a lot of papers out there and that the curators cannot read them all. The text mining tool comes in and extracts concepts and entities. And then some magic happens and it generates three proto, I call them proto-annotations, and the idea is to present them to the curators to say, in this paper I think that the show evidence that this protein is located in the nucleus, and then we could go back and verify the assertion, and then if we agree, if the curator agrees, they can accept the annotation, and somehow it would prepopulate this in, for example, the bioeditor. And if the curator doesn't agree, you reject the statement, and then it gets fed back in the system to train it better. So hopefully this doesn't show up again. So this is, as a conclusion, I just want to say what I presented today is many different projects in many different areas. We're doing quite a lot of diverse things at Nexproth, but really the underlying theme is biocuration. And biocuration, I strongly believe, also as feeling very strongly about the biocuration society, and so on, that biocuration has a lot of possibilities to be integrated at many different steps within what I call the research cycle. So there's analysis of curation that's needed for data to be loaded into repositories, either before they load it or after they're loaded. We know that there's lots of curation going on there. When papers get published, it would also be nice to be able to already link the papers more with databases and with ontologies and so on. At this stage right now, most of the curation is done post-publication, but all this information is really being used by researchers on a daily basis to generate new hypotheses. So the curation work of integrating and analyzing and formalizing data is extremely important, I believe, to help the researchers discover data in all the publications, because as much as the curators, they cannot be able to read all the papers and find everything. So this is the acknowledgement slide of everybody who's contributed to this work. So in terms of content, these are the theme of biocurators that work with me, Oror, Jonas, Isabel, Paula, and Valérie. Also, there's former members and contributors, mostly from UniProt, who have worked on their kinase project. The software team, Alain, Anne, Frédéric, Valentin, Daniel, Pierre-André, so Pierre-André is directing our software group, The Quality Control by Manik, and the directors of the group are Lidie and Amos. So I'm available for questions.