 So, welcome all to this SIB Virtual Computational Biology Seminar Series. Today we have the pleasure to host Fabio Rinaldi, who is a senior researcher at the Institute of Computational Linguistics of the University of Zurich, where he coordinates the ontogen project, which is a text mining project for the biomedical literature. Fabio holds a bachelor and a master in computer science, obtained at the University of Udine in Italy, and he then works in major research centers, such as the Institute for Scientific and Technological Research in Italy, which is now a foundation of Bruno Kessler, the Institute for Integrated Publication and Information Systems in Germany, and the University of Manchester Institute of Science and Technology in the UK. In 2008 he completed his PhD in Computational Linguistics at the University of Zurich, and during his career, Fabio has also been visiting different universities as a research visitor, such as the University of Colorado, the University of Tokyo, the University of Mexico, to name a few, and now Fabio is working at the Institute of Computational Linguistics at the ETH Zurich since 2005 and is also affiliated to the SIB since 2016. So his group, which is called the Biomex group, specializes in information extraction from the biomedical literature, as well as other textual sources. Information extraction is an important component of text mining systems, and the group specializes in the extraction of domain-specific entities, such as genes, proteins, drug disease, and their semantic relations, such as protein-protein interaction, gene disease associations. And additionally, the group provides an environment for assisted curation, which is currently being used in the curation pipeline of the regular database in the project funded by the US NIH. So today Fabio will tell us more about information extraction and their application for biomedical knowledge discovery. So thank you again, Fabio, for accepting to come over here, and the floor is yours. Thank you. Thank you very much for inviting me, and thank you very much for this introduction. I'm surprised that you could find so much information about me, even things that I completely forgot. Thanks. Thank you for that information very well. Yes. So let me start. I have structured the presentation in four parts, an introduction, a lot of large introductions, and to choose the problems that we are dealing with, some example applications. Then I have a smaller part about the material we're dealing with, which is mainly the biomedical literature, so the techniques that we use can potentially be applied to other forms of textual data. Then I'll have a section about our own research in this area, especially recent research. This part might be a bit less exciting for those who are not in this domain in text mining, but I hope to keep your attention anyway. Finally, I have a section about the projects. We are currently a recent and current project, where I hope again to be able to provide you some exciting material. So let's start with the introduction. We compare text mining, information instruction to conventional document search. This is done often in this kind of presentation, where in conventional document search, basically you enter keywords, enter system, find documents, and contain those keywords, or some of them are relevant to those keywords. We're still left with the task of going through those documents and finding information to be originally looking for. Suppose you were interested in causes of diabetes or treatments of diabetes. You will get documents talking about those things, but still you have to go through these documents. The idea about text mining, which is sort of a more general term, which covers those information instruction, is that starting from various forms of text, books, publications, sources as well, we'll be able to create information, structural information, the forms of entities like proteins, genes, diseases, interactions among them. By the way, this picture comes from this online tool, which has been running for several years, I think since 2004, and attempts to do this kind of, create this kind of structure. So text mining has been defined several times. For example, this definition is often cited by Matthias, discovered by computers of new previously unknown information, by automatically extracting information from different written sources. Key element is linking together, extracting information to form new facts or new hypotheses to be further explored. It's a very ambitious goal. How do we find new unknown information? Well, that citation comes from, say, an article by Matthias in a magazine where she was briefly describing the result that she previously presented in this paper that was presented at the ACL, which is the Conference of the Association of Computational Linguistics in 1999. And it's strange because the application she was dealing with and presented at the Conference of the Association of Computational Linguistics is an application in the biomedical domain. And basically she was saying that what we want to do is to solve some problems in molecular biology. For example, find the functions of novel genes by finding relationships with other known genes. For example, we have some features of the unknown gene, newly discovered genes, which are similar to features of known genes. And we look for information about those known genes and find potential suggestions for the function of the unknown genes by association with the known genes. And I think this is very important. The emphasis of the system is to automate the tedious paths of text manipulation and integrate computational driven analysis with human-guided decision-making. And this is important. This is the human-guided decision-making. The system is not making decisions. The human is making decisions. The system is providing suggestions. That was an experimental application. So I think the definition of Matthias is based on an earlier idea that was already published in the late 80s, early 90s, to the right by Don Swanson, which he called a literature-based discovery. This is the days before PubMed. So they didn't have a lot of literature available, and most of the work by Swanson was manual, basically. Basically the fundamental idea was that if you are interested in a connection between an entity A, say, an illness, and an entity C, for example, a drug, and we don't have any explicit evidence of this connection. But maybe we have evidence of connection of illness, say, with chemical beef, with enzyme max, and with gene y. At the same time, we find connection of drug C with the same entities. Then we could hypothesize that there is a connection between the two. That was the idea of Don Swanson. And in fact, he tested this method with, in a number of cases, he had a few publications at the time. A famous example here, connection between migraine and magnesium. Basically it took these facts from titles of PubMed paper. Like stress associated with migraine, stress leads to loss of magnesium. So here we have a connection through stress. Calcium channel blockers present in migraine and magnesium is a natural calcium channel blocker. And again, another connection through a calcium channel blocker. And so on with that example. So through these connections, he was able to hypothesize potential unknown, not yet proven relationship between different entities. And in fact, I think a couple of hypothesis were then proven experimentally. I think this, I prefer to call this kind of approach literature based discovery. And I take it as going beyond text mining. So for me, text mining is really the techniques behind it. The techniques of text processing that allow this, go towards this kind of approach. And my preferred definition is this one. Text mining is the ability to process unstructured text, typically a large set of documents, interpreter meaning, and automatically extract concepts as well as relationships and montage concepts. So find entities, objects that are relevant in the particular domain, biology in this case could be genes, diseases, experimental methods. Connect them to relationships and use this information extracted from the literature for applications. Applications could be assisted curation, knowledge-based creation or update, question and answer. I'm discussing a couple of examples later. And here we have a vision paper from Andrej Redzecki in 2009, where he basically says that starting from the literature, applying information extraction, you get these objects from the literature, these entities, the relationships among them, and we could use them in different types of application like question answering, like starting temporal trends, evolution of a field over time, by seeing what things have been talked about more or less over time. So the information instruction step, which is one of the top, delivers phrases which link entities to predicates, like enzyme activates or RNA up-regulates or down-regulates the proteins. These facts can be then stored in a database and this is the material on which then data mining techniques can work. We could maybe find even conflicting facts because some experiments in the literature might not be reliable. And by seeing that some facts, some relation between A and B is positive in the majority of cases and negative only in a few cases and we could hypothesize that the cases where it's negative might be errors. And the goal would be potential goal long-term goal construct a map of science, a global description of the relationship between both those entities. Now I've talked about information instruction, let me explain with examples from another domain what is information instruction. This is a classical example from a series of lectures and basically information instruction is an approach and method which was born about 20 years ago, 25 years ago and there were competitions called the Message Understanding Conferences and the goal was to get from text some structure representation of some essential content of that text. Here you see the steps, the essential steps have been done, recognized the fact that there are names of people here, recognized the fact that there are names of organizations and words that express a relationship between those persons and those organizations. We collect these facts without doing any kind of deprocessing without trying to understand in detail the text, just getting this elemental information from it and trying to combine it to create a database that shows the relationships between these objects. This is the idea of information instruction. The first step is to name the condition in terms of relation instruction and then the construction of the database. The same idea can be applied in several domains and my work focuses on the biomedical domain. Where does this lead us to? What can we get with these techniques? This is an example of widely used commercial application of biomedical text mining. I've blanked here the name of the product because it's a commercial product but if anyone in the public who has worked at a pharmaceutical company will be able to recognize it because it's been sold to many pharmaceutical companies and it could be a lot of money to be able to mine the literature in this way. This tool is... well, we recognize different types of entities. It can be customized in several ways and it's a rule-based. We can write specific rules to recognize relationships among them. In this case, you see an application to chemical substances or medicines, I think, chemical substances that have a medical relevance and you see quantities that have been recognized. Yeah. And it's believed that the tool is also able to recognize synonyms, although CBIT has been recognized as a synonym of Kirito. This was an important problem. In fact, there are different names for the same things, for the same objects. Another example is again a commercial tool I believe it comes from research institutes, but I believe they sell it commercially. Again, I omit the name. Basically, you see the idea is to be able to interact and interact to recognize different types of entities. It could be important for understanding quickly what the document is about. There are some diseases, breast cancer, LDL protein, BCFP, sorry, the picture here, the full resolution and tend to associate these fragments of text that we have recognized to some reference. And the reference is to be in a database. For example, we could go to recognize a gene, we could go to the entro-gene and provide a link to the entro-gene that correspond to that piece of text. And again, for each entity who would probably find some reference. Okay, so what do we need in order to be able to implement a solution like that? Well, now it comes a bit, the part which is a bit more technical, and what comes, what we need to use techniques from my field of origin, which is computational linguistic and natural language processing. Well, what you see might appear easy at the first sight, but it is not. There are different levels of analysis that we might need to perform in order to find what we need to find. And here I have listed some. First of all, we need to identify the sentence. Again, seems trivial, but sometimes it's not. A dot not always means the end of the sentence. It could be an abbreviation. It could be something else. And I've read that in some cases, in some papers, there are sentences which start with a lower case letter, even on the case. It starts with the name of an entity, the name of a gene that needs to be written lower case, and you don't even have the clue that the sentence should start with another case letter. So, easy, but not trivial. A tokenization. Tokenization means finding the word, split it and text it into words. And again, you might think it is trivial, but it's not always the case. What we see here, is the iPhone, should we split the iPhone or not? In this case, probably we don't want to split in two parts, but there are cases where the iPhone separates two synthetic items and that we need to split. Again, seems easy, but there are a lot of tricky cases. So, a lot of attention needs to be paid to do it appropriately. We need to recognize abbreviations. Correference. We are talking in a sentence about something we mentioned before. For example, we mentioned the protein and we could say this protein, something. And so, if we are looking at a connection between the protein that was mentioned in the previous sentence and in the current sentence, we have the function of the protein and some properties of the protein, we need to be able to make this connection. And again, it's not obvious because this is not always there is not always a single possible antecedent, a single possible referencing. So, the Kubiata entries are mentioned just before that one. So, it's a well-known problem in natural language processing. So, you see different steps of text analysis and here at the top you see the steps of conceptual analysis. Recognizing the fact that some of these words represent entities of the domain, objects that are of interest to us. Like this one represents something of interest, this one, and this one. But then in this case, it's not even clear what we want to pick. The blood cell, the white blood cell, the white blood cell death, all of them could be relevant concepts. And for each of them, we could find probably some kind of database. Once we have found a segment that represents an entity of interest, like this one, the problem is not finished because VCL2, if we look it up in a database, we probably find several identifiers that correspond to that. If we look at it in uniprot, say. And what is the correct one? What is the one meant in this position in the text? This is a problem and the most complex problem which is not shown here would be that of relation stuff, finite effect that this sentence expressed some kind of relation between these three entities, VCL2, V533, and the white blood cell death, and exactly what relationship is being expressed. So an approach, a possible approach that has been tested several times is to use some syntactic analysis. This part of the figure represents syntactic analysis of this center. And I'll show you another type of syntactic analysis later. So these are all techniques that can be used towards building information instruction systems. We don't need to always, we don't need always to use all of them. For example, if you're only interested in finding the entities, not interested in relation instruction, probably would not do all this part. It would be an overkill. Okay, so information instruction and text mining have been a popular topic for applications of interbio-medical domain for several years. And given the relevance, given the fact that there is so much information to be found in the literature, there have been attempts to evaluate how well information instruction or well text mining technologies do. And these evaluations are typically organized in the form of competitions where organizers provide some tasks for the participants to be solved. For example, find all the gene names in these papers. Organizers provide set of papers and participants have to find all the gene names mentioned in those papers. And the first of those competitions, to my knowledge, was organized in 2003 in information instruction task as part of the KDD Cup where the participants had to build systems to help the fly-based curation process. Then there has been the track genomics track that is more about information retrieval that is finding documents that contain information that is relevant to a given problem. The binary competitions that aim at finding complex relationships between entities and the bio-creative competitions which have focused more on finding entities, finding some types of relations and relevance of these techniques to curation, to database curation. Okay, now I mentioned let's say historical if you want of some well-known biomedical text mining system that are still up and running. Textpressor is used in the curation pipeline of some databases. ChiliBot is an application which allows you to find to mine the literature for relationship between biomedical entities. I hope it's similar to focusing more on proteins and what is it is another one which is provided by the EBI in Cambridge. So as an example I don't have the time to go through all of them but as an example this is ChiliBot and in ChiliBot you can enter some names of entities for example you can enter one, two, three entities of number of entities in this case this is an example I tested yesterday I've entered these protein names and process name, apoptosis and names of proteins and what the system has found is gone through the literature has found sentences that mention those entities computed often a relationship between these two entities is present in the literature and created a graph which represents the co-currents of those entities in the literature Here for example we see that there is a limit probably now many papers the system goes through but this is for example a connection between apoptosis and Cray8 and in the other relationships the system will show you the sentences that mention that relationship in this case I've clicked on Crab and Casp3 and yes Crab and Casp3 and the system has shown me some sentences that mention the two expressing some relationship the system also tries to determine what type of relationship is present between positive or negative relationship between the two entities ok now I've shown you the vision I've shown you some applications, some tools as examples let me tell you what is the starting material of all this work and the starting material is the biomedical literature we can do all this or try to do all this literature based discovery, text mining only because we have access to a rich source of the biomedical literature the well known repository of biomedical literature references is Medline which currently contains 27 million references and it goes very fast at the moment there are about two publications per million other two so it is often argued that keeping track of this amount of publications is impossible for humans and that is presented as a reason to use text mining technologies because humans will never be able to keep up with this amount of information of those 27 million about 1.5 million are available as I think it's more now I give another slide about I think it's 4.5 million are available as full text in PubMed Central PubMed Central is where you get the article in full text so we take this for granted now we take for granted that we have access to the literature but it's actually not that long that we get that access let me tell you a bit of history just for curiosity so Medline originated in from from the National Library of Medicine and the first version if you want it was a printed version in 1979 they started to print an index of biomedical literature called index medical and they kept publishing until 2004 and stopped for obvious reasons in 1964 they were creating a digital index in 1971 this index was made available to network connections for remote access from libraries for example so from a university library you could access this index information about publication and they could support as many as 25 users at the same time of course there was another age in only 1997 the internet was available PubMed started with the idea of making accessible to everyone the literature but mind if PubMed makes available only tab stats and tabs are contained only a little amount of information because we would like to have access to the full papers before I get to the full papers it is often said that PubMed grows at an exponential rate I take that normally to mean that it grows very fast but somebody has actually tried to fit an exponential curve on the growth of PubMed so in this paper under 2006 they took the total entry in PubMed over time to speed up time and the new entries and they fitted two exponential curves I'm not sure that fits for the entire duration of PubMed I personally believe it would be impossible to be an exponential growth forever but certainly it is a very intensive growth publication coming to open access the open access movement is more recent in 2003 there was a statement on open access publishing that basically said that the literature should be coming out of research funded by public money should be open to the public it's also logic but it wasn't the case for a long time follow up with the Berlin declaration just to say Mia and just the next year PubMed started with the aim of making available full text papers so currently there are 4.6 million after the previous number was outdated and they have about 2,000 full participation journals are journals that agree with PubMed to make available all type publications full text and there are 4,800 partial participation journal those are journals that make available some of their articles or make available their articles with the delay for example within 6 months they can go to PubMed center or within one year they go to PubMed center so 4.6 million articles is quite a lot it's almost 10% of PubMed and you think we might be happy with those but in fact open access doesn't necessarily mean that completely within PubMed center there is a subset of articles called the open access subset contains about 1.8 million articles at the moment which is articles which are made available with creative common license and means that are free to use for several purposes what is the difference these articles you can read them you are free to read them the article with the open access subset you can do more with them for example you can process them automatically now you will think why I mean if these are available why can you process them but the reason is the publisher often have a little intercontracts that even if you can if your library a lot of money to get the articles you are not free to process them automatically and I think this kind of limitation is restricting the potential of applications to discover knowledge in the literature and I think it would be a good initiative for Swiss Institute of Bioinformatics to push for a change in the law in Switzerland that would make such kind of clauses illegal. I think this happened in the UK for example a couple of years ago some people working in tax mining petitions, the parliament and the law has been changed and it is actually logical because research is produced by projects funded by public money we pay the publishers to publish this and we are not even free to do whatever we want with this article so this would be an important initiative to change this disaster so that all the literature should be free and accessible for whatever purpose that we might want for the scientific purpose okay now the parliament center was created with the goal to integrate the literature with a variety of other information databases intentional and significant discoveries that such links might foster excite us and stimulate us to move forward. This was the statement when the parliament center was created so the initiator of the parliament center had in mind the idea that we could link the literature and database and create a rich knowledge sources from this combination from 2005 TANAH which required all projects funded by them to produce open access versions of the scientific output so submit to the parliament center and by the way, just today this National Science Foundation sent out an announcement saying that they introduced the same rule that is all research funded by the SNF we'll have to be published in open access form I think this was long overview it's a good initiative but still many publications are not open access a lot of publishers have these kind of clauses that I mentioned before that restrict the possibility to process the literature and I think it's time to change and I hope that Swiss Institute of Informatics does something to change the situation and be happy to coordinate such an initiative now let me tell you something about my research so in order to do this kind of implement a system to build this kind of system that finds entities and relationships in the literature or in other text sources we need to find the entities the items of interest in the domain so our approach is based on sourcing information from several databases many of the SIB resources several data NIH and CBI resources and basically fetch from those databases the terminology that defines the entities that we want so every entry in the Uniprot we take those names for the proteins we take those names and the corresponding identifiers and we put all of them together you create an interface that you see where a user can go and say well I want to transform the cell ontology from CBI from entry gene you can the system also when you open the page will check if there has been an update of any of those resources since the last time of these resources so select what you want and then basically you can download a simplified file in a very simple format which contains the terminology from those databases because there is a lot of other information on those databases which we don't need for text processing but what we want to keep is the names, aliases of the terms and third IDs the original third IDs allows us to go back to the original entry if we want to find more information we also studied the properties of the terminology that we obtained from these databases for example this is a study of the ambiguity of terms in different categories these are for example 2, the cell lines in chemical and these are the logarithms in scale and what these graphs indicate is that for example for cell lines almost 50,000 of them are unique there is a unique correspondence that we can identify there is no ambiguity while this one shows that chemicals are a lot more ambiguous there are a lot of cases where the same name corresponds to two identifiers or one identifier has two names so more than two and so this would be easy to process basically for text mining system this would be difficult to process for text mining system this table shows the same problem but the problem is particularly big in genes and proteins where in average every identifier sorry, every term can correspond to up to 1.33 37 identifiers there are, as you know genes and proteins given a name you find hundreds of identifiers in average so many so this ambiguity in those cases is very difficult cases shown here where we have diseases where for an identifier there are several synonyms but there are different names for the same object this is not a problem for text mining system but this is instead a big problem this table shows another type of problem ambiguity across different terminologies the previous problem was ambiguity within a single terminology this is the problem of ambiguity between different terminologies and we see as for example cellular components overlap up to 55% with genes and proteins this means that the names that are used in paper to mention cellular components well, half of them could be also the name of a protein or a gene for a cell line up to 67 could be names of proteins and genes so when we see that name it will not be obvious if it is now the cell line being mentioned and this is the problem that we biggest problem we have to solve the ambiguity of those names and there are some cases which are very easy so, what we do with all these terminologies we develop a part of studying the properties of the terms we have also implemented an annotation system this is called AUGA on the gene antithesis recognition basically you can go here and you will find a web interface and you can enter a parmed ID or some pretext and select some terminologies and the tool will annotate the document you present this can be used just as a demo to show what the system does on a specific case but it could also be used as a web service in the same page you will find indications on how to use this as a web service as a web service it means that you could build an application that just sends batch of documents to this server and we get them back in an annotated version of those documents and we have different input and output forms so we have implemented this with the rest of the web interfaces and we have also evaluated this service in competition, text mining competition some of those that I mentioned before so these are input and output formats that we have the format, parmed-central format these are the types of input the formats are text input, pretext BioC which is the format that has been defined for biomedical text mining XML format, PXML which is in the format of parmed-central TSV, tab separator value, single format and this is visualization visualization tools formats for various visualization tools we have also a disambiguation of the system which I mentioned in a moment but the disambiguation service is not yet integrated into the web service so what you get from the web service at the moment is only the plain terminology lookup basically marking up of all the entities in text as I said the best of this in competition in last April where the goal was to provide namely anti-recognition system for several different types of entities and organize evaluated different systems in this case evaluation was about efficiency as you see we did quite well several metrics we did we were the best and the way we implemented this was that the system receives a request from the remote client and then can distribute those requests to different workers which would then combine the results and give the annotation back okay I said the first part of the system the ogre does the anti-recognition as what you see here but the next problem is to assign an identifier to each span of text you have seen a moment ago that entities are very ambiguous between a single class and across different classes for example if we look at HSC we can find a protein that corresponds to a genontology term or another protein or a KB term so that name could be a different thing of course in that particular context but how would machine recognize and meaning automatically what is the correct interpretation well we have tried to implement systems that solve that problem but before I go to the system let me show you some examples why this problem is so big name of genes this is the name of genes and if we continue we can probably find for almost every English word a gene or a protein that uses that word if we would do this in a naive way it wouldn't work now some funny gene names probably you have seen some of these lists I can tell you a story but in the interest of time I skip the stories about the funny gene names another example ambiguity is the cat cat is basically an animal it could be an experimental method it could mean computerized axial tomography if you search it in androgyne then you find several genes that have this operation so recognizing them, recognizing an interpreting entity correctly is not that easy so what we have tried to do in my group is to implement a machine learning approach which will disambiguate those annotations try to find the correct interpretation we have tested this on a manual and data corpus which has been created by a group in the University of Colorado the corpus called CRAFT they manually annotated 67 articles so in those 67 articles they went through every single mention of gene diseases etc and marked it up with the correct interpretation what we try to do is to replicate that kind of work of tomography could we do it and we have paper recently published that shows that that so described the type of terms that we deal with we are able to achieve state of the art results state of the art results basically mean that we have on this particular case notation in this corpus we have the best results published so far and these are a precision of 86% overall the terms that we deal with best recall of 66% and best score of 70% this is the problem finding entities just finding for each span of text is this a gene, is this a disease is this whatever it is the next problem is as explained once you have that span of text what is the correct ID and for this art second problem we get which is much more difficult because we have a lot more IDs we have this we have these results so about 50% in precision recall and best score it doesn't seem much but it's the best available so far okay and then we are trying now in current experiments to improve on that by using information that is available in so there are cases which are unambiguous several cases which are unambiguous maybe we can use those cases to help solve the cases which are unambiguous but we are trying to learn this is current work okay but this was just a later result we have been doing this working in this area for more than a decade now and we had several of these results we have participated in several competitions we have obtained best results for example in rainy season 6 we had the best system finding experimental methods in 2009 best system finding for the type of entities okay and conclude this part and then if you give me another 5 minutes I will show you my current projects to conclude this part what we have this part of my current research we have systems for named recognition and concept identification which are state of the art and called the biotherm or publicly accessible you are welcome to test them and see that send us feedback and only warning the disambiguation which I mentioned is not yet included in the web server I hope that this we will have it very soon included in the web server okay now quick overview of my project where we apply those things so that was the basic research, fundamental research that we do and where do we apply those things, quick overview we had a project in collaboration with the veterinary faculty of the University of Bern where we looked at pathology reports pathology reports of the veterinary the farmers bring dead animals to the veterinary clinic to be examined to find what the animal had and the veterinary doctor writes a report in free text in German in this case and this free text can contain every sort of thing this is one disease it's the intestinal torsion and you see there can be misspellings there can be spelling correcter variants of the same term there can be ponies of the term and reflections of the term so again we are confronted with the problem if we want to find all the instances of this disease we have to be able to say automatically that these are all the same of course and in order to do that we have to examine the terminology and recognize the similarity and we do it to some extent automatically we connect those terminologies to standard resources here we have terms that we extracted automatically from the pathology reports and here we have a standard resource the universal medical language system of the National Library of Medicine and we basically try to find the connections between the terms that we find and the concepts that are represented in the UMLS this way we can build bridges between terminology and the conceptual representation of the content of the text and all these analysis that we find by our partners at the veterinary faculty in order to study those diseases the diseases of those animals we have built a classifier that basically classifies each report according to a set of categories of diseases they have created a set of categories for the two animals that were studied how and things and they automatically classify the reports according to the disease of the intestinal system or the disease of the reproductive system or whatever so we have an automatic classification of those and the use test and results of this classification to do epidemiological studies to see where some disease are appearing and also geographically and temporal when maybe there are some anomalies in the seasonal maybe sometimes you notice an anomaly in the trend and you might have hypothesized that there is some kind of problem some kind of epidemic going on so this is one project another project involved is mining the literature to find information about causes of psychological diseases and there can be a lot of causes of psychological disorders and the challenge here is to define it's so varied the possibility and basically we have created a corpus a small corpus where we have manually annotated the relationship between diseases and potential causes, panic attacks say alcohol abuse and smoking potential causes of significant predictors of panic attacks and so on and this information from this corpus is then used to train them that should recognize automatically similar cases and this again can be used to study temporal trends like here in the literature over time mention of different disorders this trend could also be reflected changing the terminology that can be used rather than some kind of trend in the real diseases in the occurrence of the diseases and the most complex type of problem to solve is to find the relation between the causes between the genes and the disease and here we use the kind of deep analysis that I mentioned before this was not actually used in this project I mentioned it as an example of potential deep application of natural energy processing to the literature this is kind of syntactic tree that represents the syntactic structure of the sentence and through the syntactic structure we are able to recognize the fact that the sentence expresses a particular type of relations that involves the two entities another project I am involved in collaboration with an industrial partner so a big pharma company so I cannot say too much but basically the problem that we deal with is finding whether two names of two authors are the same so this is Sherlock Holmes that's Holmes the same as this one sometimes it's obvious but sometimes it's not the application information or the affiliation has changed and so it's very difficult and we are trying to find out the developed automatic way on the basis of what they work on what they publish, the content of the text whether it can be, it might be the same person and another project which has already started but our own contribution will start in a couple of months is the NRP74 framework Smart Health Care it's a partnership with a number of partners so Ian Lausanne and Ginev and Fiveswitz hospital where we will analyze electronic health records and try to find information about reactions to antichromotic drugs this is a limited case that I studied to show the potential of analyzing electronic health records and by the way talking of electronic health records I want to mention this as an example this is a paper published by researchers in Denmark where they have access to all electronic records so 6 million electronic records of all patients in Denmark over a number of years and by starting those electronic health records they were able to make predictions about the evolution of diseases over time where you start from some kind of disease and you get some kind of treatment and then something else happens and we see this happening over and over again especially to all patients they get medicine and they get something out of the problem, they go again to the doctor and it gets worse and worse and no end of this fact it is very important to improve the treatment of the patient but sometimes by looking on a few cases the medical doctor might not have the full picture if we had all the cases and with a lot more information if we had the same accessibility in Switzerland that there is in Denmark unfortunately at the moment for obvious reasons there is a lot of restriction so one way to make them more available for research is they identification remove all personal information from medical records so that they can be made accessible for research and that is another project which I am involved trying to identify electronic health records and finally I want to conclude with this application assisted curation this is a collaboration with the regular on-dv group in Mexico a project funded by the NIH we have been proposing for many years the idea that tax money technology can help the process of curation of database and personally I had a few negative experiences where I tried to promote this idea and the reaction we don't need it we are doing fine from database curators which I mean of course the quality of manual curation should not be touched and we don't want to touch it but I believe that tax money technology can help to make the process more efficient and this example and this collaboration with the regular on-dv group what we do is try to help them find more efficiently the information that they need so the regular on-dv is a database that curates regulation in E coli there are cases like this so finding the literature information like we additionally found that the expression of MTP gene is upregulated by magnates through MTR so what will we have here we have MTR from transcription factor I think it is which promotes the expression of MTP this was a fact manually curated so they have read 6,000 papers they found similar things they found similar facts and stored them in the database now in the database they don't keep track of information where that was found which sentence they only often they are not always tell us which paper it was found but not exactly where in the paper now the next problem they were facing that well it might be useful so what was the experimental condition we will see it with that and for all these interactions that they have how do they find the experimental condition do they have to read again those 6,000 papers to find the experimental condition well together with them basically we developed an alternative solution based on text mining and the idea is shown here I have this interface called ODEAN which shows an article annotated with terminology the same kind of ideas that have been discussing at the time up to now only the terminology specific terminology that is used here is the one coming from the database so we are taking the terminology used in the database annotating the 6,000 papers that they had already read and then trying to find automatically in those 6,000 papers the sentences that express that those relationships that now they want to enrich with information about the experimental condition and in this experiment that has been published already a couple of years ago in 2014 we were able to do that we were able to show that this information can be found efficiently, reliably and once I found that sentence they can locate very easily the experimental condition next to it and therefore this helped them to get to increase the quality of the database in in a much more efficient way now in the database you have the scroll conditions which were not there before associated with different types of interactions the type of conditions that are shown here and basically the conclusion of this experiment is that technology like this technologies like those that have shown can help make the process more efficient without having any impact on the quality again it's always the database curator who takes the decision the system doesn't take the decision to take the decision or it provides information to guides guides the curator to work what he needs to find we also continue now this work we are trying to integrate these different technologies in the curation pipeline and we are continuing this work in the scope of the NIH funded project and now finally come to a conclusion a bit late I work with me Lenzurra, Tilea Lendorff, Niko College and I'm going and I was also to thank the people of the regular DB group that are very important Yali Baldera, Toto Gamacastro and most importantly last time on the list Julio Collado-Vides the leader of the group in Cuenavaca in Mexico who basically is the leader of the regular DB group and I met them at some conference, bio curation conferences and meeting them has been very important for me because as you have seen my background is entirely NLP computational linguistic computer science I don't have a background in biology I've been trying to promote my approaches to biologists, to database curator without much success for a long time I think Julio was welcoming me and we worked together and probably the reason why he was interested in my work is also a strong interest in languages in language processing some kind of background that was an important thing so let me now conclude with the logos which normally at the beginning I leave them these are institutes that I work with or collaborate with and these are funding institutions that fund my research so thanks a lot to you Ia in the room and to all those that have been listening remotely