 Okay, good morning, everybody. Bon dia to Tom. Today we have Horacio Sajón, which is associate professor in our department since 2014, I guess. Right? 15. Yeah, so he got tenure two years ago. Horacio got a PhD in Montreal a long time ago, in 2010. He has been moving to Sheffield. Seven years is a lot in our world, I think so. He has been moving from Montreal to Sheffield. He has been also spending some time in Job Hoppsings, in the US. And now he is with us, integrated in the town group, but also creating a new lab and leading a new lab. Leading also some European initiatives like Dr. Inventor. And now I think that he is quite active in the Maria de Maes to programme, leading some of the projects there. Horacio, please go ahead, you have 45 minutes and we'll have some questions. Okay. So thank you very much for the invitation. Thank you very much for the invitation, for giving this seminar. So I will explain to you the problem that I have. So a few years back, Francesco Barbieri came to me and told me that he wanted to study irony detection, and I don't know anything about irony detection, not even anything about figurative language, not anything about how to process social network texts. And this is not the only thing that the only problem that I have. I have Luis Espinosa, who came to me and told me that he wanted to automatically identify definitions in large text collections, and although I have some experience in that, that was ages ago, and I will have to become more expert in this field. And this is not my only problem. Now I have Francesco Ronzano, who comes to me, and tells me that we need to study deep machine learning, because our algorithms are not improving so well, so I don't know anything about deep learning, and I don't know very little about machine learning, and Neuronetvos has been in the literature for many, many years. So what I do is to decide to go to this wonderful, curated databases of scientific information in order to find answers to those problems that they come to me. But the question is that these data sets contain millions and millions of curated records. So, and even for the domains that we manage, there is still too many records out there. So suddenly I become overwhelmed by this overload of scientific information. Okay, but instead of becoming overload, we should not become overload. We should become happy because this brings us opportunities for research. In fact, if we are able to transform this textual information into a structure of knowledge, so you can imagine that we could be able to devise better question-answering systems. So you put a question to the system and you get the exact answer and a scientific answer. That is very good. Then you could improve, for example, the creation of scientific summaries. Not only the abstract that the author gives you, but more complete summaries of the paper. And we could, for example, also imagine that the system will go to those collections, extract useful information with which we can write directly on a state of the art report without having to read the whole papers. If we transform the text into a structure of knowledge of some kind, we can imagine that machines will go to that knowledge and maybe draw connections between otherwise unrelated papers. Or if the knowledge is well compiled, we could think that machines could do some reasoning over the knowledge basis, even propose problems to solve to us, so we can give new problems to new PhD students where they can study, and as well the machines can propose some hypothesis for us to test. Not the machine to test hypothesis, but us to test the hypothesis that the machine has proposed to us. So we were extremely lucky because a few years ago we had the chance to participate in this project, authoring mentor from the European Commission, in the area of scientific creativity. And the project will end in a few months. And the idea of this project is to create a support tool that will make comparisons between otherwise unrelated papers and try to come up with useful comparisons between the papers, but not distant comparisons. So you can imagine that you may have a paper, for example, that studies a particular problem, and for that problem, that paper has proposed a technique to solve it, and now you have another problem, and you may think, okay, this problem is analogous to this one describing this paper, so maybe an adaptation of the technique proposed here can work for my problem. So the idea is not that the system will solve the problems of scientific research, but by looking at these analogies that the system will propose, maybe you have new ideas to explore. But the question that we work with natural language processing, text processing, and we needed to support this project with our expertise in several areas of natural language processing, and in particular transforming text into a structured knowledge representations. But what happened is that most scientific papers are available in PDF format, so that's one of the problems that we have to solve, but that's not the complicated thing, because there are many tools that can deal with PDF and transform it into a text format. So that's not the problem. The interesting thing is that the scientific paper is a complex record of knowledge, so in a scientific paper you may find information of how to cure a particular disease, how to create a new medicine, or the fact that a new planet was discovered. So these are very important things for humanity. And a scientific paper will follow usually an argumentative discourse. So in a scientific paper someone will tell you there is this problem that is very important to investigate, then it will tell you. Some people have attempted to solve this problem before, but their solution has some limitations. And then you say here I have the best solution to overcome these limitations. So this pattern of argumentation is very common in scientific papers. Also we should acknowledge that the scientific paper is not an isolated unit of knowledge, but is connected to the past through citations and also will be connected to the future to further citations. And these citations are not just mentioning a related paper. These citations are made with a purpose. So you cite a paper in order to praise the solution of a colleague, or you criticize the solution of a colleague, or you use a paper to compare with your solution. So these are very important things in the scientific document. So in order to investigate the issue of what information to extract, how to extract information, what type of information to extract from a document, we have in the project created and annotated a dataset. So this work done also with Beatriz Fisas and Francesco Ronsano. So we had an annotation tool over the web. It's a collaborative annotation tool, so people from different places can annotate documents. And we were interested in some aspects of the scientific document. One aspect that was very interesting is the aspect of rhetorical information concerning with rhetorical information. So what we ask annotators, in this case, three annotators for each document to go to each sentence in each document and tell us whether the sentence reflected one of our main rhetorical categories. So, for example, is this sentence talking about the challenge that the author is facing, or is this sentence referring to the outcome of a particular investigation. And we have three annotators for each document. Then we have annotators experts in the field of computer graphics that was our application domain, to go through each paper and when there was a reference to tell us why the reference was made in that particular part of the paper. Was it to criticize a particular work, was to say that you, in your paper, are using that technology developing that paper for your own research. And also we have three annotators here. And finally, we were interested in the relevance of the information in the paper. So we were interested in knowing which sentences the expert in the field will consider more important to include in the summary. So the experts went to each sentence in each document and told us whether the sentence was very relevant to be included in the summary or completely relevant and should not be included in the summary. Okay, because we have several annotators for each of the tasks, then what we did is to harmonize the annotations and create an annotated corpus. This annotated corpus is available for research. You can imagine that with this corpus you can do pretty many, very many different things. And based on this corpus and also based on the needs for analysis of the scientific document, we developed an annotation tool that performs deep analysis of scientific papers. This tool has many components. I will go very briefly without entering in too much detail in any of them. But the input to this system is a PDF file or can be also some XML compliant publication. And we need to transform this into a rich representation. Okay, so our first step is to transform the PDF into our document format. So here what we need is a system that is able to identify the title of the document, the abstract, the different sections and very importantly the bibliographic references and all the lines in the bibliographic references. We are lucky because there are many tools available out there that you can use. We tested several tools and we decided to go for one tool that is called PDFX, PDFx, and it's a tool that is developed by the University of Mancher and is available as a service. But because a service can be complicated to integrate in a system because you have to call the service, sometimes the service is down or sometimes the time is too much for processing a document. So we have also other options, for example, libraries that can deal with PDF in our own library. Okay, so then we have a process of word and sentence identification. We usually, with our tools, with our NLP tools, we need at least to have sentences being identified and tokens within those sentences. So in order to do that, we use the gate system, tools that come directly with the gate system. But the problem is that this system is being trained or being developed for newspaper articles in general. So you can imagine that tools for sentence identification and tokenization in scientific documents will fail in this domain. So what we did is to see how this system perform in our corpus of documents and then try to correct the tokenizer and sentence splitter to not commit any errors in our texts. So now we have an adaptation of sentence splitting and tokenization for scientific documents. Another important part that we have to deal with is the identification of citations. So we need to identify in the body of the paper when a reference is made and this can be done pretty much with pattern matching. So we have a series of patterns that cover different citation styles. We fire all these patterns over the document and then we have a voting mechanism to decide which type of style was used in the particular document and then we can annotate the body of the document with the citations. Also we need to make the link between the citation in the body and the reference in the paper. So we are quite lucky because PDFXs and the other tools are able to recognize already the citation lines. But we need still to make the link. And in order to make the link, we use heuristics. For example, you may mention in your citation format the author of the article and the year or the initials of the author and the years. And you can use this information in order to go and retrieve the correct reference in the bibliography section. Also we are interested in identifying the role of a particular citation marker in the body. So if I say John Smith 1996 developed the best algorithm for part of speech tagging. So that John Smith in that particular sentence és el subject. És molt important que no diguis que John Smith és qui desenvolupava aquesta cosa. I, en aquest cas, John Smith és jugant el rol del subject en aquest senten. És important que reconeixi això. Hi ha heurístiques i un set de rules per reconèixer si una particular citació és jugant un syntàtic en el senten. A més, volem enregar les citacions amb més informació. Hi ha diversos serveis que donen informació. So we use many of them in order to collect information about bibliografie entries. So we query them. So these are services that we have to call. And by using these services we can have information about the authors, about the title of the publication, where it was published and the year of the publication. Ok, so now, we, in order to support information extraction and other activities that come down the line, we need linguistic information. So what we do is to rely on a parser, a dependency parser developed by Bonet, that give us lemma information for each word, give us part of speech information for each token in the sentence, it gives us morphological information and also dependencies. So dependencies are relations between pairs of words and these are very important for semantic processing. Then we have a process of coreference resolution, so you can imagine that sentences in a text are not isolated. They are connected by several relationships and one important relationship is coreference. So you introduce in your paper an entity by means of a noun phrase, for example, and then later on you are going to refer back to that particular entity by means of another noun phrase or by means of a pronoun, for example. So we have implemented parts of the coreference resolution system by a standard that the state-of-the-art coreference resolution system is deterministic and uses rules in order to carry out the work. So we have implemented some of the rules of this system in order to apply to our document and then we have evaluated it on abstracts of these papers and we have improved it. Then we have a process of causality identification. You may think that causality is something very important in scientific documents, so I may say, for example, that by using this particular feature, my system led to some improvements in part-of-the-speech accuracies. So this cause-effect relationship is very important. I may be interested in finding, for example, papers or sentences where there is evidence that some improvement has been made to a particular part-of-the-speech tiger and it will be very interesting to recover them that from textual data. So for causality identification, we use a linguistic-motivated approach that is able to identify in sentences and across sentences the cause and the effect of the causality. And in order to do that, we use lexicons, that verbs that imply causality, and also a keywords that are usually used to express causality, like in this case, where you have the keyword as a result. This information is also used later on for information extraction. Then we have a process of rhetorical annotation, so based on our corpus that is annotated with rhetorical information, we can train a machine learning algorithm. So the approach is that we test different machine learning, different sentence classification algorithms, trying to optimize the parameters and trying to devise good features for those particular algorithms. So the idea is that, for example, that sentence red will be identified like a challenge for the investigator, because it said that there are some limitations with previous work, and the blue line will be the outcome of the investigation, where the author says the type of algorithm that has been developed and that algorithm has acceptable levels of complexity. So the features that we use in order to train algorithms vary. So we use, for example, word-based information. We use also part of speech information. We use dependencies. We use the position of the sentence in the different parts of the document. We use vocabulary that is specifically designed for the scientific document in order to mark specific movements that the author may say, introducing something or talking about his own work or talking about the work of other investigators. Then we have a process of citation purpose identification. I put it in dotted line because it's still not performing at the labels that we need. But basically this component will be able to identify sentences that, for example, point to criticism or point to some advantages of a particular method. The approach that we take here is similar to the previous one. We use machine learning. We design a set of features. And here some features will be similar to those used in rhetorical classification. But some of them will be specific for the citation purpose classification task, where, for example, opinion lexicon may be useful here. Then we have a process of information extraction that will use many pieces of information of the previous modules. So the idea here is to create abstract triple representations that will usually contain a subject predicate object triples. And this will be connected if they are mentioned in different triples. And also they will be connected if, for example, some coreference relation exists between different noun entities or a noun phrase and a pronoun, for example. And also we will create links between nodes that are involved in causality relationships. Then we perform a process of summarization using, in part, one available tool that is in our department, that is called SUMA. So the idea here is to highlight or identify the most, the key information, the key sentences from the paper. So in this case, for example, that the paper introduces some method and then that the method has certain advantages, okay? So we use some of the features from this system, but we also devise our own features for the scientific document. And because people are usually interested in these particular aspects of summarization, I will tell you a little bit more how we do it. So in the case of Dr. Inventor, we have the annotated data set where each sentence has received a human score and each sentence has received a score by three different annotators. So we aggregate these scores and we get a number between one and five for each sentence. That may be used as the relevance of the sentence. Then what we do is to design a set of features based on peer review literature or on features that are already available and implemented that have to do with, for example, the document structure where the sentence is located in the document or the vocabulary that is being used, whether it contains vocabulary specific, marking specific relations in the scientific domain or term distribution, so whether the sentence contains terms which are key in the document or rhetorical information, whether the sentence is more likely to contain an statement of challenge that the author faces or the outcome of the mitigation or a reference to future or previous work. We also use some similar, text similarity computation, for example, whether the sentence is similar to the overall title structure of the document or to the abstract that is provided by the author, some centrality information, for example, we run the text run algorithm on our document and also on parts of the document and run the sentences and create features for those or, for example, information about the centroid, how central a sentence is with respect to the overall information in the document. And having those scores manually provided, we can train, for example, a linear regression model, learning weights for those features and then apply those to new unseen documents. And in general, for several scientific summarization tasks such as summarization of patterns or summarization of documents in the Doctor Inventor corpus or summarization of papers in the ACL anthology, we are obtaining acceptable results. Ok, so now what can we do with this rich information that is our tool is producing. Francesco Ronsano has prepared some visualizations. So these are visualizations based on the analysis of the ACL anthology, the Association for Computational Linguistics. So we have analyzed huge volumes of this anthology and this is a visualization that has been produced. So obviously you have the title of the document and the authors, but you can access, for example, to each of the sections of the paper. Obviously many tools can do that. You can go to the citations and see the metadata that has been incorporated into the document. You can go to the rhetorical classes and, for example, quickly ask which is the challenge that this author is facing. So, for example, in this case, the fact that traditional approach is like generalization, that is a challenge that he has or they have to face in this investigation. You can also try to see what the system thinks, the best sentences for a summary are. So here in the introduction the system picks up three sentences, one referring to the problem being investigated, one referring to problems with previous work and one referring to the specific investigation being carried out by these in this paper. Then you can analyze, for example, the citations, how they are distributed across years and how, what they talk about these citations. And also you can see, you can access to the context of the citations being made in the paper. Finally, you can go to the semantic representation that we produce in the form of triplets that are connected. So this is a representation of the abstract of the document where you can get an idea of what this paper is about. For example, you can see that this paper has proposed some event extraction methods that aims to extract some features. You can see that the authors have introduced a model, this doesn't tell you much, but it tells you that the authors are proposing some particular neural network architecture. Okay, and is this the information that our partners in Dr. Inventor are using to match different or to get to find analogies between different papers. But obviously this is just one paper, so you are not over one here because it's just one paper, but what happens if you have to analyze huge volumes of papers. Okay, so you can use the Dr. Inventor Text Mining Library that's available for download. It comes with some documentation, examples of use, and some exposes an API so you can process your own documents. It relies on several services, but if you want to analyze volumes of data, so our idea is that you go to a scientific publication website where PDFs are available for free and also metadata information, we can bring all that information here to UPF, we can process that with Dr. Inventor, then we can semantically index that information and allow people to do analysis and visualization. And these are some visualizations that Pablo Acosta has produced for us by using the Microsoft Academic Graph and also documents from the ACL process by our tools. So for example, this is a panorama of what the authors and the co-authors that participate in ACL related events, and you can see also the authors in a map, so the publications that are geographically located. So this is the first thing that you can do, then you can go a little bit in depth and ask, for example, for a particular author, so in this case, Chris Dyer, so you can see where this person has been. In many places in the US, you can see geographically located this person with his amount of publications in ACL related events, and then you can see the collaborators and you can see where they have been. So you can see the internalization of this particular researcher. Then you can wonder, okay, this series of events, what they study in ACL related events, so you can have the whole set of related events in ACL and you can have a cloud of concepts to get to know what we investigate in this particular field. Obvious disentences, words, information come very highly. And you can also see the amount of publications that exist for each of the events and the geographical location of the papers in the map. You can then also be interested, for example, in examining how the topics have varied in this conference. So you can see, for example, that natural language is across the whole spectrum of years, but you can see, for example, that information extraction and question answering started a little bit late, maybe in the 90s, when there were some events related that were organized in the US to fuel the research in information extraction and then in question answering. You can also see that machine learning and language models, so statistical techniques came a little bit late in our domain. Okay, so finally, you can go to a particular conference. So, for example, in this case, the main conference and see what people are studying there and also the evolution of the number of papers that have been published. And you could go to a particular topic, such as, for example, word embeddings, and see how many publications are about word embeddings and the popularity of this topic is raising. And you can see all the papers that have been published in this field and the topics that are related with word embeddings. Okay, so this is what you can do with massive amounts of information, so you can go down and explore the universe that you have. And this is about the scientific paper. The scientific paper is a very complex record of knowledge. Obviously, it's a particular type of document that is meant for a specific audience, but there will be also documents made available for general audience which also have their complexities. And this leads me to the second topic I would like to address. It's another line of research that we carry out in our laboratory that is text simplification. So, there are texts like newspaper articles and also texts that appear in encyclopedias that are meant for everybody, which actually are very complicated to read and understand for certain people. Okay, so for example, if I go to a definition of opera in Wikipedia, you will have this sentence, opera is an art form in which singers and musicians perform a dramatic work combining texts, libretto, and musical score, usually in a theatrical setting. So it's a very long sentence. It contains complex syntactic constructions and also it contains some complex vocabulary, such as libretto, expressions such as musical score or theatrical setting. So that can pose difficulties for several people. So there are organizations that prepare what are called easy to read texts. That usually will contain, will describe a topic or will tell you a story with short sentences and more common vocabulary. And usually that logo there will be used to indicate that this is an easy to read text. You can find imprinted material and also on the web. So one may argue that articles that appear in simple Wikipedia are more easy to read and understand. So for example, this is the same topic, opera, as it is defined in simply Wikipedia. Opera is a drama set to music and opera is like a play in which everything is sung instead of spoken. So a little bit easier to understand. But there are many examples of simplified material. So for example, the Spanish Constitution, that can be a document very difficult to read and understand. So article 7, for example, talks about the rights of the workers and also the companies. So it's a very long sentence. And there are versions of the Spanish Constitution that are easy to read. So for example, this one, you have one sentence for the right of the workers, a short term in a more clear language and also the rights of the companies. News are notoriously difficult to read and understand. So this is an example from a site called Time that publishes a news in English. And that long sentence with a lot of jargon is the first sentence of an article talking about the dismantling of the jungle camp in Calais. And in the UK there is a site that is called United Respond that provides also news, in this case, unequivalent news, but in a very easy to read form. And believe it or not, even opera can be made accessible. I mean, a few days ago I went to Liceo. I bought the booklet that you can buy to read the libretto. And I was surprised, there was a logo that you can scan in your mobile and you get the easy to read version of the libretto. So even opera, believe it or not, can be made easier. Okay, so adapting the content of a document can be very beneficial for several users, such as people with a language impairment, people who suffer from aphasia or people with intellectual disability. Even people who are learning, for example, a second language that probably don't master the whole lexicon and grammar of the language that they want to learn. And there are some adapted material maybe necessary. Also people who are non-especialists in the domain, obviously will find difficult to understand the terminology there, even they have knowledge about the grammar of the particular language. So some adaptation will be needed. So mental adaptation is, however, very costly to produce. It requires human experts and also it's time consuming, so it's not cost effective when you have to simplify, for example, all the news that are being produced. You cannot imagine that you will have editors constantly creating easy-to-renews for the public. So we have in our laboratory a line of research, since my arrival here, and we were working in two different projects to try to automate the process of creation of this easy-to-retext. One project was called Simplex, where we developed the first densification system for Spanish. And the other project we are working on now is called A-Veig to Include, to develop accessible technology in particular in our laboratory text simplification technology, to include people with intellectual disabilities into the information society. And here Montserrat Marimón and Daniel Ferres are collaborating with me. Ok, so to give you an example of the question is if we can produce this type of simplification. So to some extent, so we cannot achieve the performance that a human would achieve, obviously. And this happened in many activities in natural language processing. So this is a demonstration of our system called YATS. The interface was developed by Daniel. So the system means yet another text simplifier. And the interface is like a translator, a common translator where you can input or copy and paste a text and then press the simplified button and you will get a simpler version. So it's a prototype. It's not something that you can now use all the time. So you have a sentence there. Your automobile house being repaired by our mechanic. Please come collect your automobile from our office. So the system will transform the first sentence, changing the vocabulary. So automobile into car. And the passive voice will be transformed. Also the verb repair will be transformed to fix. And then the second sentence has also some vocabulary transformation. So this is the label of simplification that we can achieve. So no complex or para-frasic transformations for the moment. So this is an example of our other simplifier. So if you understand a little bit of Spanish, so this is simplex, you can also do the same kind of interface. You can enter a sentence or a set of sentences and then ask the system to simplify them for you. And in this case, the first sentence will be split into two less complicated sentences. And the second one will be... There will be some transformation of vocabulary. And this is also what we can do, so not very complex transformations. So in order for you to understand how we do these things, so we do these things into two steps. One is called lexical simplification and the other is called syntactic simplification. And it will go little by little without entering into much detail on the technology that has been used. But given these texts, your automobiles has been repaired. So we have a process of text analysis. This process will give us, for example, lemmas for each word and also part of speech tags for each word. Then we have a process of complex word identification that hopefully will tell us that automobile és probablement complicat per a lliure o a entendre. I repair és també complex per a lliure o a entendre. So, we have to find a replacement for these elements. So, then we have a process of trying to find simpler synonyms for these words. So what we do is to query lexical database, such as WarNet, that is a curated database, so that's very important. And we may get this list of possible synonyms. In this case, we are lucky because automobile is a monosynomous word, so we don't need to worry about different senses. But in the case of repair, if we query WarNet, WarNet give us up to five list of synonyms because repair is polysynomous. So here, we need to pick up the right sense, so we have a process of words and disambiguation and hopefully this process will pick up this list for us. So now, we need to identify a simpler synonyms out of the list. So we have a process that identifies car for that automobile and picks for repair. And then we have a process of sentence generation. And hopefully this will be easier to read, not in all the cases we achieve this. Then we have a process of syntactic simplification. So here, again, we need some text analysis and the text analysis will also give us lemas, will give us parts of a speech. But here, we need more deeper syntactic analysis of the sentence. So we use, again, a dependency parser that we use generally in our laboratory. So for example, to tell us that Eva in this sentence is the subject of leaves. This is very important to have, very useful information. And also all the other dependencies between the different words in the sentence. Then what we apply here is a linguistically motivated and linguistically informed approach. So what we have is a set of simplification grammars. These grammars are in charge of identifying complex constructions in sentences and decide how to transform the sentence into new sentences. So the simplification grammars, for example, we identify that Eva and leaves in France, these two parts are the main clause of the sentence and that this part is the relative clause. And then we have a process of sentence generation that we use these annotations plus additional information, for example, the fact that the system is able to solve that the pronoun is linked to Eva in order to generate this sentence and this sentence that has some transformations as well. And hopefully this will be an easier to read text. Okay, so we obviously perform evaluation. In general, we do here in vitro evaluation. So we evaluate against data sets, created data sets, or we evaluate with speakers of the language and we assess the performance of our system and we try to improve whenever possible. And we also in vivo evaluation is done by partners in general. In our project we have no access to the final users in general. Okay, let me show you how this technology is integrated into actual applications for the information society. So there are different applications where text simplification and other accessible technologies are being used in the able to include projects, for example, as an accessible Facebook, an accessible WhatsApp and also an accessible emailer. So when I use Gmail, I feel overwhelmed because Gmail has too many functionalities, it's difficult to manipulate the menus. Maybe not all of you are like me, but I feel like it is very complicated to manipulate. I can imagine for a person with disabilities can be even worse. So we have a company that was collaborating with us in able to include, develop this email client that is very easy to use and also incorporates accessibility functionalities. So it only has the basic functionalities that you need, read, send a message and administer your contacts. So these are messages that I have received, so I can go to any of these messages and I can, if I don't understand the message, I can ask the system to simplify the message for me. So this is the same sentence that I have shown in the interface. So I can ask the system to simplify. So this text will be simplified. You can use some standard text-to-speech technology... How a mechanic has fixed your car? ...to read out the message. And also other partners in the project are providing with pictograms, so translation of the message into pictograms. The emailer is called Columba, it's still under test, but we think it will be a useful tool for the society. Ok, so... I have argued that we have a problem of scientific information overload and in general a textual overload and I will call this big unestructural data, which is very pertinent for this department. And we have tools that are able to transform this big unestructural data in some knowledge representation that for humans can be used for performing intelligence studies, for example, or to answer very challenging questions, or to produce summarisation, better summaries for the scientists. For machines, this knowledge can be used to provide useful recommendations, for example, could be used to draw inferences, find associations between unrelated papers, and also probably to formulate some hypotheses that later on we would be able or would be wanting to test. There are several aspects where we lack expertise and this brings many opportunities for collaboration. So, for example, we have produced some visualisations, but we are not experts in visualisation, so we need your help in this area to produce the optimal and best and more useful visualisations for scientists to access huge volumes of information. We need to be able to perform domain adaptations. We have been working on a particular domain that is computer graphics, but there are many challenging domains in science that we would like to adapt. And here in this department, there are many domains being covered and it will be useful. If you have a collection, we could try to adapt our tools to that collection. It will be interesting to go from beyond text and to incorporate other types of scientific data. This could be software, for example, that we use in scientific investigation. It could be also analysis of data sets and complementing textual information with data sets and also analysis of images that usually complement the text and also the text can be used to better understanding of the images and the image can contribute with understanding of the text. With respect to the complexity of the documents, we have technology to produce multi-lingual versions of which are simpler to read and understand, hopefully. And these can be incorporated into useful tools for the society, in order, for example, to improve accessibility. And here we lack expertise in several fields. For example, one of the things that I think will be promising is to adapt simplification to domains that are being dealt with in these departments, such as education, for example. It will be very useful to be able to provide a simplified version of educational material or of health material. So reading health information is extremely complicated. So the vocabulary is very difficult and sometimes the expressions are not well understood. We don't understand the consequences of reading health information. We have no expertise, for example, in psycholinguistic or psychology. So in that sense, we are unable to create language resources that contain psycholinguistic information that could be of health in order to develop better simplification systems. And we lack also expertise in human studies. So we don't know how to test, or we know very little how to test whether a simplification solution could be understood or what are the aspects that are good or bad, our solutions with specific audiences. Also, we have no much experience on in vivo evaluation with the target users. So I think that there are opportunities for collaboration here. So if you are overwhelmed by information, so we have some tools in summarization, simplification and information instruction that may help and also solutions that need improvement. So your collaboration will be very welcome. So these are the people that are working with me or have worked in the past. Francesco Beatriz, Montserrat, Daniel, Simone, Luis, Francesco, Ahmed and Pablo. So thank you very much for your attention. Who starts? Leo. Okay. Thanks, Horacio, for the nice talk. So I would have a question on the first part of your talk. No, so we saw the origin of Chris Dyer. No, so he was mentioned as affiliated with CMU and with Language Technologies Institute, no? So, and the Language Technologies Institute, that's, well, Institute of the CMU. So do you use a kind of ontological or let's say domain representation to recognize that the Language Technologies Institute is part of CMU or this kind of... No, we take the information that is provided by Microsoft Academy Graph and this is the information that we use to represent this in the visualizations. So there is no ontology that is supporting this tool. Okay. Thank you, sorry. So it was a very interesting, very impressive talk, so about the many capabilities, the many things that you can do. I was more wondering about the methodological aspect, so the methods that you use. So maybe it's a bit going against the direction. So I would like to a bit desimplify the description of your work. So you mentioned something about machine learning, something about pattern matching. I gather also that you work on ontologies and things like that. I don't know if you can comment briefly in a nutshell, a bit the main work that you do methodologically. Yeah, so we use both depending on the problem. We can use linguistically motivated pattern matching approaches, for example, in the case of simplification, and we can use also machine learning in some aspects, as I said. For example, for summarization, it will be very difficult to come up with patterns that will provide you with a good solution for summarization. So in summarization, for example, we use machine learning, so in general we need an annotated dataset or some idea of what is some human judgment about the relevance of information. Usually those datasets can be obtained, for example, in the case of the scientific document, you may think, okay, I can train an algorithm because I have the abstract of the paper, so I can train with that information, considering that the abstract of the paper contains the main information, you can go to the document, find similar sentences to that particular piece of information, and then train an algorithm with the relevant searchmen that you induce by comparison with the abstract, or in our case we are lucky because we have a corpus that we have decided that this was important to produce a corpus with relevance features, relevance values, that can be used to train a classifier. A this classifier can be a linear regression algorithm like the one we use, or if you have a discrete set of categories, very relevant, irrelevant, misrelevant, you can probably classify also the sentences. Yeah, machine learning. And with respect to the approaches, in general, but we are moving, trying to move to other methodologies, in general we design features based on previous literature and also features that we consider that are important and then we assess them to see which ones contribute or do not contribute to the task. Yeah, so in general we perform cross-validation experiments and if there are data sets available, there are many challenges in which we participate, so in those challenges usually you get some training data, some development data and then it comes the text, later comes the text, so we perform our system, we search the best possible configuration for our system, we adjust parameters in development and then we participate with test and that's the way we go. For things such as simplification, you require a precise output, then I think that, for example, a linguistically motivated approach is better because you can, if something fails, you can actually correct these errors and improve the system. So depending on the type of problem we apply one approach or another. Yes, what I want to ask was like... So you basically mostly use these simplification tools, as I understood, for these graphics, articles for computer graphics, I was going to ask how easy for anyone else to take this system and apply it in a different domain. Let's say someone wants to make a new simplification tool, of course it has its own characteristics, its own features, for example, instead of challenges, maybe there would be another feature of what's going on in the news. How easy would it be for someone else to kind of take the system or maybe edit it, per exemple, is it open source? These things you develop, is it... Yeah. Okay, so yeah. Okay, so it's not the simplification system, but the whole processing of the scientific document that you are wondering. So it is true that different domains will have different ways of expressing information. So we already are aware of that. For example, if you go to chemistry, there will be other categories that you have to tackle. However, we are lucky because in other domains there are also annotated datasets in some domains. For example, there are research that have produced annotated documents for chemical research. There are annotated documents for, for example, computational linguistic papers that follow a different, probablement diferent argumentative approach. And therefore, you have to retrain the system on those datasets in order to be able to use it. The tool is available, it's not open source for the moment, and we are elaborating whether we should do it open source, but it is a library that you can install in your machine, in your organization, and you can process documents. But yes, you need adaptation of the different tools. For example, the summarizer has been adapted to this particular corpus and what the idea of relevance is for this particular type of text, what it means for a sentence to be relevant. So you have to retrain the algorithms, so you cannot apply directly to another domain, it may work okay, maybe you will need to tune the system to your new particular domain. But the question is that the methodology is there, features, we have many features, libraries that can compute features for you, and some of those features will adapt to the new domain. Thank you, Aracio, for your great talk. Since you started from the original challenge of having so much information on one specific field, do you envision in the future to be able to generate the perfect compendium summarization of one specific field of knowledge so that we can go and read that and then we know everything about it? Well, I am leaving the dust to Ahmed, who is a new PhD student, who is investigating a generation of state of the art reports. Obviously that is very difficult and there are already some works that have explored that, I think it's still immature research, and there will need to be put a lot of investigation into that, but yeah, it will be nice to have. I have one question for you, which is about tech transfer. You mentioned that the Google interface is quite complex for you and for much of the users, and are you planning to talk to Google to invite them to use these facilities, these kinds of facilities in order to make the Google interface, the Gmail interface, more? If you have contacts, I will be happy to contact them, because if I send them a message, I don't think they will answer. No, my question is, who would be the main user of these kinds of methodologies and regarding all the different phases that you have explored? For these tools on accessibility, there is a huge population that needs accessible tools, per exemple, apart from people with intellectual disabilities, of some cognitive disabilities, there is us, the old population, you know? I mean, we need better tools, better design tools, so there is a huge public that will appreciate this. When our partners that have access to the users evaluate this, they are fascinated by the possibility of using an e-mailer, you know, to just receive a message from someone they know, and then to write a message and to be able to press the correct button, to send and to find information. So I think that these tools should be obviously put there for the good of society. So yes, technical transfer is something that we would like to do. Open source, or with licensing, or with APIs on the internet? Depend, for the part that I talk about scientific text mining, we have already some ideas of what we would like to do. So for example, software as a service would be a possibility. For the e-mailer I think it's too early to say, so I don't have a still idea for the simplification technology, it could be also software as a service. So, we have to study that. OK, thank you. Any other question? OK, so just before going to the snacks, I just take a couple of comments. One is the next seminar will be in charge of Rafael Ramírez from the MTG, the Music Technology Group. And another information is that the last seminar of this year will be in charge of a professor which is not from our department, so we will experiment inviting all the researchers within the UPF and will be Jordi Galdí from the Department of Economics. He is a master in international economics and I think that he will try to share his thoughts about international economy and try to see some synergies among us. So, thank you, Horacio, again. Thank you very much for coming here.