 Hello, I'm Nancy Ide from Vassar College in Poughkeepsie, New York. My co-author is Dan Blankenberg from Cleveland Clinic Genetic Medical Institute, and our presentation concerns a project to bridge natural language processing and biomedical data analysis. The basis of our work here is scientific publication mining. And the idea is that there are, in fact, a lot of ways that you can use scientific literature to extract information that you're interested in, but also discover new information, like gene cancer, gene drug interactions, drug repurposing, and so on. But there are a lot of obstacles to scientists who might want to pursue this. In principle, just the difficulty of honing in on relevant publications is sometimes insurmountable. And furthermore, there is a lack of expertise, usually, among scientists in extracting and exploiting information from unstructured textual data. Finding relevant publications, problem number one, is really the main obstacle, the sheer number of publications that exist these days. At the end of 2014, there were 32 million references in PubMed alone, obviously quite a lot more now. Thousands are added every day. And even when you do manage to get some results, they're not always very helpful. So for example, a search for does P53 bind to the MYC promoter that you might get okay, do okay on, but a search for what transcription factors bind to the MYC promoter. You're going to get thousands of documents, and sifting through those is really prohibitive. And thirdly, there are multiple search engines, multiple repositories, and you either much choose among them or go between. That's a problem too. Now language, natural language processing, which we call NLP, can help you with this by applying, say, a preconfigured workflow that performs a deep linguistic analysis that will go beyond what you can get from keyword-based search engines. And the same tool can help you go even farther once you've got the relevant documents to find and extract explicit and implicit information from retrieved publications. There are several NLP tests that are relevant for exploring biomedical texts. Information retrieval, we've already talked about, you know what that is, information extraction is extracting facts and events that are of interest to the user. So the user provides an instance of what they're looking for and extraction can help find all the occurrences, or maybe the relationships among two occurrences. Thirdly, and perhaps most exciting, is data mining or text mining, which discovers unsuspected associations and it can discover new knowledge and find new associations that were not obvious or explicitly stated. How NLP works is it uses linguistic information, like grammatical relations, et cetera, with semantic resources like ontologies and controlled vocabularies, and the major technologies are named entity recognition, relation extraction and event extraction, and all of this is supported by statistical analysis, but even more recently, it's almost exclusively machine learning. Entity recognition, NER, is really the most fundamental task because it's just what it says, it recognized entities that have been named like protein species, et cetera, and it will recognize say variants of the same name and so on. But for biomedical texts, NER is difficult because there is such heavy use of domain-specific terminology, and people are constantly introducing new terms, people use short forms of abbreviations and alternative forms, and there are a number of other complications as well. Relation extraction, which is the test of semantic relations from a text, usually is between two entities or more, maybe. Generally, people look for things like person-organization, location with relations like married to, et cetera, but from biomedical texts, you're usually looking for interactions between biomolecules or events occurring subsequently over time, temporal relationships or causal relationships. Here's an example of relation extraction. That first little text from there, there are two entities, Doppler echocardiography and artery stenosis, and the verb is diagnose, and so we have a triple. The Doppler echocardiography diagnosis, artery stenosis. Down below, a little bit more complicated. We've got binding to GTP. So we get that GTP binds to the RAF protein, and then that in turn is said to bind to the RAF protein kinase. Text mining, data mining is finding new information as stated. The classic example is Swanson's work where he discovered a connection between fish oils and Reneau's disease. That blood viscosity was the link. So documents with the fish oil referred to its effect on blood viscosity, and documents related to Reneau's disease noted correlation between blood viscosity and RD. So he basically said, hey, looks like fish oil treats Reneau's disease, and it was clinically corroborated. Text mining isn't just for full text, but it could also be applied to other kinds of textual data that science researchers are often using, like annotation metadata from DNA sequence repositories. Now all this is great, but scientists don't usually have the expertise or the time to build an appropriate NLP workflow or to find tools, etc. And even if you can manage to use NLP, it's not easy. In fact, it's near impossible to use the results in your familiar biological analytical tools. What we need is a platform that provides access to a broad range of publication data, access to a wide array of NLP software, adequate computing resources for large-scale textual analysis, and ways to ingest extracted data into familiar biological analytical tools. And what we need, this is all in one place. And we have a proposal for a solution, the language applications grid called the LAPS grid and Galaxy. A LAPS grid is a platform, an existing platform for NLP, and it includes popular open NLP tools like Stanford and Spacey and so on. And important for the non-computer scientists, the tools are all interoperable, overcoming an obstacle that a lot of people encounter that the output of one tool from one place doesn't work as the input to another tool from another place. The good news is, LAPS grid already uses the Galaxy framework as a way to combine its NLP services. All of the tools that we have are implemented as web services, and we provide some visualization. We have means for evaluation and ways to save and share workflows, which is basically part of Galaxy. We have been using Galaxy since 2012. Here's a screenshot of the LAPS Galaxy instance. So it should look familiar, at least the format of it. You can see over on the left-hand side, there are a number of different kinds of tools. And they are grouped also down at the bottom. You can see that they are grouped by developer. And here is a screenshot showing the biomedical named entity recognizer clicked on there underneath the biomedical tools. All our tools, as mentioned, are interoperable. So here we have an open NLP tokenizer fed into a Stanford sentence splitter and a link pipe part of speech tagger. And then the gene tagger is actually a tool from yet another platform. So you don't have to worry about that kind of thing. We have some visualization. In this particular case, you can see that there's some annotations for genes on top of the part of speech annotation there. And down below, here's an example where entities have been annotated for their IDs in the gene ontology. LAPS has an ask me search and query engine. This shows a query and its response. And the thing that doesn't show on this particular screenshot is that you can, there's a button once you get results to import or export these and import them into Galaxy so that you can use them directly in an NLP workflow that you have. Another thing that we support, which is extremely important for biomedical text mining is domain adaptation. When you are working on a specialized domain, the likelihood that there are resources like a vocabulary list that have all the entities that you're interested in is minimal. And we have some support that the scientists can run, say an entity recognizer, look at the list manually, identify maybe missing ones or incorrectly identified ones and then adjust your lexicon or your ontology, retrain the model and rerun the pipeline and then look at the new performance evaluation statistics. And so where did the Galaxy tools come in? Well, what you would like is that you could transduce these results into formats that standard Galaxy analytic tools can use. So here you see you get your results out of your maybe iterative improvement and in they go to various different Galaxy tools. Our goal then is to provide researchers with the ability to go from publication database query all the way through and to an NLP analysis and ingestion of results into your analytic tools without leaving the Galaxy platform. And here's a visual of how we envision this happening. Three steps, search and retrieval, ask me, goes and queries numerous archives. We then have an NLP workflow, a format converter and we go into Galaxy analytic tools and then getting whatever results are relevant. In order to accomplish all this, we have to do some extending of the lab's grid and Galaxy for the lab's grid. First off, we want to meant ask me to provide access to a wide range of public repositories like PubMed, PubMed Central Archive, Met Archive, etc. And this would in itself be a new thing that you would have one stop access to a substantial portion of the literature and you wouldn't have to shift. And we want to beef up our retrieval performance. We've already published a paper when there's a reference down below that is a show that asked me zone of power with similar platforms. We want to also obviously augment our NLP services to better serve Galaxy users. One thing would be including more tools for specifically tuned for biomedical analysis, we want to implement configurable workflows that for basic document preprocessing so that people can if they want to do entity extraction, they've got a workflow already there relation extraction or dependency parts which also can provide a lot of relational information. We want to provide tools that extract relevant information from NLP tool results. We saw the visualization of that annotation for the genes and so on, but you might just want to see what the list of genes is and so on or relation triples. And we want to integrate lapsed tools as native Galaxy tools, which would enable people to make use of their own resources. To Galaxy, we have to enhance the interoperability, but especially include support for web services which is not currently natively supported. So then you could obviously full support of the lab script, but there are also many many bioinformatic web services and now you could have access to these things and they don't have to be specifically wrapped or annotated or in any way, you know, jerry-rigged to be input to Galaxy. Another thing we want to enhance Galaxy to directly support linguistic data set constructors, annotators and visualizers. In other words, we're integrating the lab script much more fully into Galaxy than it had been before. Another thing is implementing means to curate and save a customized document set for your specific research needs. And very often, you know, you're going to go and create, do a query, get a corpus, maybe it's domain specific and you want to reuse it, you want to share it with other people as well. And right now there aren't, there are ways to share workflows but not resources like this. So that's another extension to Galaxy. We need to implement transduction tools that will enable researchers to transition between, you know, the linguistic results and the bioinformatics tools. So here's an example, so a researcher may extract sets of protein ideas from linguistic queries over their custom literature-based corporate and then retrieve protein data bank files. Another thing we want to do, or along in the same line, is have a collection of tools to cover standard use cases for the transduction, like protein interactions, gene disease association, gene cluster identification. We want the transduction to be bi-directional that you can go backwards from informatics tool to linguistic analysis tools. So for example, if a researcher conducts a single cell omics analysis and gets some differentially expressed proteins or pathways, you can use that output as a basis for further linguistic analysis and maybe even find new information. So the products that we aim to have, one is a collection of Galaxy workflows that start by leveraging linguistic analysis, transduction facilities and standard bioinformatics tools to perform meaningful analysis. And we're going to look at things that people have commonly done in the literature and start with an initial set of workflows for various things, protein docking, gene disease associations, et cetera, and engage the community to ensure that these modules are generalized and not. Finally, documentation and interactive hands-on training will be provided and all of this will be submitted to the Galaxy training network. So we will have tutorials as well as documentation. Stay tuned. We are hoping to begin implementation in the next several months and thank you for your attention.