 will be given by Horatio Sajon and he's in the natural language processing group and he will present his project. Thank you very much Xavier, so I will present, I will, I am very happy to present the project, the collaboration that we are carrying out in the context of the Maria de Maetsu. It's a project to mine the knowledge of scientific publications. I am Horatio Sajon from the natural language processing group in the department. So as we probably already know, we are overwhelmed with the amount of information, scientific information available as researchers we face a growing number of publications. If you take, for example, the EC Web of Knowledge, it indexes over 90 million records. If you take, for example, PubMed, that is the reference database for research in the biomedical science, it contains over 24 million publications. And if you take, for example, if you are interested in the field of patents, the European Patent Office has received since 2006 over 2.5 million applications. Not only this legacy database contains such amount of scientific documents, but recent estimates indicate that a new research paper is published every 20 seconds. That means that this will continue to grow at an exponential rate. So this obviously poses challenges for scientists worldwide, because we need to keep updated on new developments, on new fields that are appearing. We need to review papers, we need to review project proposals, write our own proposals, write our state-of-the-art reports. But with these challenges, also we have a nice opportunities. And these opportunities as, for example, they need to develop advanced text mining and summarization tools so that we can transform this textual information that has no structure, this text, into a rich knowledge extractors. And by doing that, we can help humans, for example, obtain interesting answers to very challenging questions, so if the knowledge is structured in an appropriate way. We can create improved document summaries that will be different from what the author says about his or her paper. And we will be able to obtain relevant information from here and there in order, for example, to write our state-of-the-art report. So also if we transform textual information into actual knowledge, we will be, machines will be able, for example, to reason over that knowledge, will be able to make connections between otherwise unconnected pieces of information and will probably find synergies between researchers and research areas, things that we have not thought before. So the outline of the presentation that I am going to present is as follows. First, I will talk about this collaboration that we are carrying out in the context of this Maria de Maetzo Initiative. I will, this doesn't start with this Maria de Maetzo Initiative, but we have some background information to present, so I will present the background for this project that is the Doctor Inventor Project. I will talk about a data set, an annotated data set that has been created in this project that is very useful to develop and to implement several machine learning algorithms, for example. And I will talk about a text mining and summarization tool that has been created and the objective is to make it more available and more rich in the course of this Maria de Maetzo Collaboration. Then I would like to present two case studies that we have carried out in the past six months. One case study refers to the exploration of semantically annotated publications and another case study on summarization of scientific articles, in particular on computer graphic articles. Then I will briefly comment on related projects that we believe are similar to what we are doing here, but actually a little bit different. And then I will close with some contributions so far, summary and outlook for the future. Okay, so we decided to set up a collaboration between two complementary but related research groups, the Natural Language Processing Group and the Web Research Group. So these are two groups that have expertise that have some intersection, but in fact we deal with different topics. So the Web Research Group provides expertise in manipulation of huge volumes of textual information, analyzing this textual information, indexing this textual information and developing tools for appropriate search. And the TAN Research Groups provide expertise in transforming unstructured information in textual format in some rich and semantic representation. The members of the collaboration, some of them are here. So we have Francesco Ronsano, Anna Freire, Diego Saez, Ricardo Baesa, Beatriz Vizas, Luisa Spinoza, Francesco Barberi, Ahmed Aburayet and myself. So if you see them around you can ask particular questions about the details of the project. Okay, so the background of this project is an European project that started in 2014 and will end in six months. So this Maria de Maetzo framework allows us to continue developing the tools that we have created and the datasets that we have created in this project. And it is a project on the area of scientific creativity, where the main objective is to develop a support tool that will allow scientists to discover analogies in otherwise unconnected pieces of research. But in order to create this tool you need to work with semantic representations of documents. So documents that not only the text but semantic information. And we know that most papers are available either in paper format or in PDF. Although there are XML formats that are being currently exploited by some organizations such as Elsevier or Springer for example, most of the publications are available to us in PDF format. But PDF format is just a format that doesn't provide any semantics on the content of the documents. In a research paper you will find things such as the title of the paper, you will find the authors, you will find the abstract of the paper, you will find different sections, you will find tables, you will find images, references and also acknowledgments. All these are very important parts of a research paper that we want to exploit. But this is only the format of the paper. Actually we are very interested in the content of the paper and also on the argumentative structure of the scientific articles. In a scientific article the author will try to convince you that he or she is addressing a very important challenge. He will try also to convince you that some past research has addressed this problem but not in an appropriate way. And then in this paper this problem has been solved by the researcher. Also it will point to some limitations. So understanding this argumentative structure is very important. To identify in the paper what is important, what is past information and what is new information. So that can contribute to create, for example, better summarization tools. Also a paper is not an isolated unit. It has a network of collections with past research and in a research paper people not only, the scientist not only cite the paper, but they also have strong feelings about what they cite. So they may praise a particular word, they may criticize a particular word or may just mention a word because it is an established paper in the field of research. So all these things are also important in order to understand the work that has been done by other people. Okay, so in order to understand how to deal with these complexities in the Dr. Inverto project what we did is to create an annotated dataset. So the field, the particular field that we work in this project is computer graphics. And we start with a set of computer graphic articles and with an annotation tool that was developed by Francesco Ronsano that is a collaborative annotation tool that can be used over the web. We have experts from different sites that annotate the documents. What type of annotations we add to these documents? On the one hand we add summaries. So we ask annotators to go to each sentence of each paper and tell us how important is each sentence for a summary. So they will say, okay, this sentence is number five because it's very important to be included in a summary. Then we ask them also to annotate in each sentence rhetorical information. So is this sentence the objective of the paper? Is this sentence mentions the hypothesis that the author wants to prove or disprove? Is this sentence the outcome, the main result of your investigation? And then we go to the citations and we ask the annotators to annotate the reasons why this citation has been made. It has been made to criticize. The author has been made to point to an advantage of the approach presented by the author and so on. And since we have several annotators per paper, then we conciliate the annotations and we create a gold standard. It's the Dr. Inventor Annotated Corpus that is available for research. We have exploited the annotations and the annotated corpus that is available for research and that we have exploited in order to develop some of our tools. So based on the needs for processing scientific documents and based also on available software and available peer review research, we develop a tool that is able to process, to do a lot of processing and enrichment of scientific publications. So I will not go into the whole detail, but it is a pipeline of natural language processing tools that is able to identify citations, identify links between citations, is able to annotate sentences with rhetorical categories. It's able to parse sentences in order to create rich semantic representations and also is able to produce summaries of different compression lengths. So I will give some details on some of the components. So for example, as I said, most of the publications that are available to us come in PDF format. So in order to transform the PDF into a rich XML representation, what we do is to rely in a service that is provided by the University of Manchester that is called PDFX. And this tool is able, for example, to give us some information about the title, the abstract, the caption, the different sections and the bibliographic section of the paper. So we selected this tool because it was the one that best fitted our purposes, it was the best performing tool and also was available to be used for free. In order to deal with citations, so we have several processes, so in the body of the document we try to identify all the citation markers and this is carried out using some routes. We also identify each line in the bibliography using some information that is provided by us by the PDFX output or by some XML format that also can be read by our tools. In order to link the actual references to the line in the citation, we use some statistics and also another thing that we do is to recognize that some citations, like this one, place a syntactic role in the sentence. That is particularly important because you would say this author did this thing. So it's important for us. In order to identify sentences and words, we rely on the engage framework. So we use tools that are available in order to identify sentences and words and we have adapted them to the particularities of the scientific document. And in order to enrich bibliographic entries with semantics, what we do is to make use of several web services that are available or some web scratching techniques in order to collect information, rich information about each of the citations. So in order to create semantic structures out of the paper, we make use of a dependency parser. A dependency parser will be a parser that creates specific dependency relations between words. And we use the made dependency parser that has been using several projects in our laboratory. It's a trainable parser. And we modify this parser in order to be able to deal with citations. So in the first example that you see here, the citation plays no syntactic role in this sentence. Instead, in this case, this citation plays a syntactic role. It's the subject of this particular verb. So it's important that we do that when we analyze scientific documents. Then we are interested in annotating sentences with specific rhetorical categories. And for that, we use the annotated corpus that we have to train a classifier that is held to predict for each sentence which is the appropriate rhetorical category of the sentence or a default category that will be none of the five that we are interested in. Okay, also we deal with a coreference. So try to identify in a paper which words refer to the same entity. This is particularly important in order to create a knowledge that is connected. So we have implemented the techniques that have been reported in the stand for deterministic coreference resolution approach. So we have two steps. One with rules and some heuristics identify entities in the text. This uses linguistic information from the parser. And then we use a set of filters in order to create links between the entities in the document. So in this way, we create a connected knowledge in the document. And by doing all this, we are able to extract triples that are our semantic representation. You can think of those as open information extraction triples of the type that are very popular now. And because we have also coreference information, we are able to link different entities with coreference information so as to create connected graphs. In order to create summaries, we have an available tool in our group that is called SUMA. And this provides functionalities to rank sentence to score sentences and to allow the ranking of sentences by score and implements a number of classical summarization features. And I will tell a little bit more about that in the case studies. Okay, so the doctor inventor takes mining libraries, a library that is available. It contains extensive documentation, contains an app to access the document model and to access different modules and incorporates a number of services that are invoked in order to enrich the documents. It's a pure Java and it's been implemented following the gate framework and also incorporates summarization capabilities from the SUMA library. Okay, so in the Maria de Maetzo project, what we do, so we have this, so we have the doctor inventor and the supporting tools within the doctor inventor. So in the Maria de Maetzo project, what we do is to add layers that will allow us to do more intelligent stuff. So we add a process to crawl information from the web, to bring documents, to store them locally. We process those documents to create semantic representations. We index them and once we have indexed them, we can perform interesting analysis. Okay, so that scientists could ask questions, could see visualizations of information that has been gathered and can write reports, for example. Okay, so I will present two cases studies. The case study number one is the exploration of semantic base annotated documents. So in this case, what we have done is to crawl the ACL anthology. The ACL is the association for computational linguistics. It contains documents from over 10 years of conferences, not only the association, the ACL conference, but also associated conferences. So we have crawled all the documents from that site. We have also crawled metadata available in big text format. And we have passed all, we have stored locally this information and we have passed it through our text mining tools. And now we have documents that are enriched with structural and semantic information that we can access to. I will move to this place to show you one demonstration. I hope it will work. So this is a document that has been processed by our tool. So it's a document, a recent document, so you can see already some noisy data we've embedded in super spaces. And this is some very recent topic in computational linguistics. So you can look at the structure of the document. So you have all the sections that have been identified. You can look at the different bibliographic entries. The bibliographic entries have been enriched with additional information from available services. You can look at the rhetorical classes that the system has identified. So, for example, if I click challenge, you will see their sentence. However, this approach is prone to overfitting when a training is performing with the scarce and noisy data. So this is a challenge that the author has to address probably in his paper. And we have also sentences that have been highlighted as relevant sentence for this paper. These are not sentences from the body of the document. In addition to that, we create visual representations of the extracted information. So this is, for example, a representation of the abstract of this paper. So you can see that nodes represent concepts, and then there are arcs, concepts or relations, and then there are arcs that link the nodes by subject-object relations. And you can see already some some nodes that contain coreference information. So that's actually how the links between the different nodes can be done. Okay, so I think that for visualization I will close this now and I will continue with the presentation. So a second case study that we have developed within this project is the summarization of scientific documents. In particular, what we have done is experimentation and summarization of computer graphic documents. And we take advantage for this of the availability of the Dr. Inventor corpus that I just referred before. So in the Dr. Inventor corpus so each sentence has been graded according to what experts believe is important for a summary. And in addition to that experts, once they have graded the sentence, they have written, three experts have written three human summaries from the information that they have found relevant. Okay, so what we did then is to design based on peer review literature and also some intuitions about information might be important for a summary. We have developed a set of features and the features are these seven sets of features and in each set there will be one, two or more features. Okay, so we have features for document structure so the position of a sentence in the structure of the document might be important to assess its relevance so it's a sentence that appears in the conclusion, maybe it's more important for a sentence that appears in a related work section. We have a vocabulary so there are specific vocabulary that researchers use in order to highlight information that they present so they will say for example our paper outperforms or our technique outperforms such and such approach so this can be a clue of two point two important information. We use term distribution that is usually used for paragraphs or documents in document collection so we use them at the level of the sentence. We use rhetorical information so the class or the type of information that the sentence is reporting. We use centrality information for example we use an implementation of the Lex rank algorithm that is a very powerful summarization algorithm in order to measure the centrality of a sentence in a document and we use also core reference information so the fact that in a sentence there will be some entity that has actually a long life in the document. So in order to learn to summarize what we use is this formula where we put the features and we learn some weights and we use a linear regression to learn these weights and during training what we do is to score the sentence using the values that the informants have given us and during testing obviously these scores are produced by the system. Then having score the sentence you can select the top end sentences, the top 25% of the article, the top 10% of the article or you can try to select as many sentences as the standard tells you to select. And then there are a number of evaluation metrics in text summarization that are being used to compare automatic summaries to gold standard summaries and the approach that we have developed outperform are very competitive outperform centrality based approaches that in general work quite well and also well known baseline that I use in evaluation of text summarization systems. OK, so related projects there are many projects that deal with the scientific documents so probably most of you will know TBLP, a reference library for information science and computer science. Google Scholar that indexes absolutely all papers that are around there. BASE I see are minor other portals that provide information on scientific documents. All these tools are very useful because they aggregate information from authors, they have to deal with the problem of author identification and to link records together if they belong to the same author. They also provide information on citations and citation counts but what they don't provide is many of them or most of them don't provide semantics. They don't provide semantics of the information that they have gathered so they will not provide representations of the information, they will not provide information instruction of the document they will not provide different types of summaries tailored to the particular reader of the summary. Probably the most closest to our will be all these challenges that have been carried out in the semantic web publishing challenges an initiative that is done in conjunction with the extended semantic conference where the objective there is to give a procedure of a document you annotate it with semantic information so that you can then ask or answer important questions about the evolution of a certain conference or a certain workshop. If you use ACM Digital Library so you will notice that when you enter the document you see here powered by IBM Watson and IBM Watson is a tool that performs text mining over massive amounts of information so I think that we are on the right track to have such a project here. So the contributions that we have obtained so far so in terms of publications we have released the Doctor Inventor Corpus and has appeared in language resources and evaluation conferences. Last week Ana Freide presented two papers one in the workshop on the Biometric Enhanced Information Retriever and Natural Language Processing one is an overview paper of this project and another is participation in a text summarisation challenge where we obtained quite good results in one of the tasks we were in the second and in another task we were first and I was presenting also some work on summarisation in NLDB last week. With respect to software and data set we have the Doctor Inventor library available and also the summarisation software the Doctor Inventor Corpus is available for download. We have also semantically organized and rich collections such as SIGGRAPH and ACL and they are available for visualisation as outreach in activities as part of our plan when we presented the project we said that we wanted to have a tutorial or a workshop organized somewhere and we have a tutorial accepted for calling, calling is one of the main conferences in our field of expertise and will be held at the end of this year. Also Ana and Diego are supervising a year at Duris who is doing a master thesis exactly in the topic of this project and will be defended in September as so we work with text so we will go from the text to the knowledge but text is not the only thing that we should look at so as many of you have mentioned before so in a paper there will be references to software software will have documentation documentation will be in natural language so we can use the natural language to interpret the software so papers refer to data sets those data sets will be also linked to some documentation and these documentation can be analysed and enriched so you can provide a semantic layer on the data sets. We only look at the text for the moment but it will be very useful to explore the complementary information of the papers and figures we can extract knowledge we should be able to extract knowledge from them because they contain the key facts maybe about the paper so the key comparison between different approaches so this is very important. Also if I present the paper maybe tomorrow it will go and present some related talk and will be featured in video lectures so I will talk in part about my work so it is important that we analyse that to the presentations in order also to enrich the publications and obviously there is more than formal citations as the ones that we do in the papers because researchers are very active in promoting the research in social networks such as Research 8 Academia and also on Twitter so it is important that we look on ways in which we can exploit this information in order to enrich the publications. Ok that's all for the moment thank you very much for your attention and if you have questions I will be happy to answer Good question Very interesting work I think it is really important especially if you take what you said in the medical field for example there is a huge amount of publications but I think the most challenging there would be the most interesting that way is that in a lot of publications you see tables with measurements of a certain parameter in a certain patient group and it would be extremely useful in order to be able to pull that data but I guess it is also very challenging to kind of see which data belongs to which. Do you think you are close to trying to do that or is this something like ok maybe in a decade we would be able to do that Hope we won't have to wait for a decade but I hope that in the next round of projects we could continue this with incorporating this visual and tabular information into the research quite challenging, we did some research before but it was very quite preliminary and when I was not here I did some research on that, it is very very difficult because of the characteristics of tables there are very different types of tables and understand the semantics of them is unclear but you may understand the structure but not the semantics but complementing the text so where you describe the picture, complementing the caption information the text and complement it with all the text that comes in the tables something could be done so maybe depending on what you want to do if you want to be very precise or have high recall the techniques may vary but it is a very important thing that we need to address and related to that also like looking at figures is really important because even if you now would go to Google images and you type a thing you see already a lot of figures from papers but of course they are not useful in the context so how is partially your approach different from what Google is doing now or how can it be complementary I suppose that we are very different from Google they have all the resources in their hands to do this kind of things but we are concentrating on analyzing a whole structure so no isolated images we are not doing the image thing but we will not be able to analyze isolated items like images so it has to be taken in the context of the paper in order to take advantage of what is being said about the image in the paper so as to enable its interpretation but do you see a possibility to try to leverage from both ways because I mean this image based approach is extremely interesting in some ways to look at it and on the other hand this approach provides more information but it would be interesting to reuse some of these things or do you see any way of doing that or do you say like no these are total different ways of approaching it I think that reducing stuff is what we do so in our approaches we look first whether there is a tool that is able to do something that is reasonable for us that we can take as it is or take and improve it so any tool is already providing some analysis of deep analysis of the images that appear in the paper and can give us some clues on some explanation that is given in the paper in order to make the link will be very useful thank you that's it I have a quick question it's actually somewhat related so one of the things I'm interested in is so can we query the scholarly record and ask a question for example can we all the publications that use the famous data set to X, Y, Z that should be possible so for the moment we are starting this project started this project that started six months ago so with the whole structure that we have that would be possible it would be possible to say show me papers that compare part of a speech target X with part of a speech target Z and tell me how they compare one that's interesting to do thank you what about paper recommendations it would be nice that the system this is one of the case studies that we want to address paper recommendation and also author recommendation so for possible collaboration so if you have a European project that you want to set up and you have your topic and you need expertise in some field it would be useful to the system to tell you this to talk to so we are very interested in those cases studies