 Now it is all right, okay, so I'm standing for this from the Institute for Language and Speeds Processing, Athena Research Center in Athens, Greece, and I'm really glad that Peter has already introduced a couple of times the term text and data mining, and what I will be mainly discussing is legal interoperability in text and data mining and the daily battles that people that develop and aspire to run research infrastructures that deal with text and data mining have to carry out. So yeah, text and data mining by its first component, the word the term text, brings us back to language, because text essentially is language. So one may wonder what the people that deal with language have been doing over the past couple of years, and I can tell you that unlike what people thought in the past and unlike what the early Chomskyan days back in the 1960s would somehow preclude, language, linguistics, language studies, language technology, text mining, and all the rest of it has become a very data intensive science, a very data intensive technology. This being realized, we started a couple of years, almost seven years ago. This is the right time, this is the right number, seven years. So we started seven years ago with the Pan-European network that is running until now and we hope that it will be running for a couple of years more under the rubric Metashare that focused mostly on two things, on how you can make, how you can foster sharing of data in the language and language technology world and how you can make data discoverable. In tandem with Metashare, there was another and there still is another research infrastructure called CLARIN which runs all over Europe and I've been personally involved in the development and running of the Greek network of the European CLARIN research infrastructure. Not surprisingly, this is called CLARIN EL. And there we moved one step further. So we added the issue, the question of processability of data on top of the issues of sharing of data and discoverability of data. And the last, if you want, development in this evolution is one infrastructure project that is called OpenMinded that adds one very critical layer for running an infrastructure and this is the layer of interoperability. So in OpenMinded, we deal with both sharing, discoverability, processability and interoperability. And I will be saying a few more words about OpenMinded in a few minutes. Now here you see a very nice example of what it means not to be interoperable. What you see in the red rectangle, the red shape there is operationalization of infrastructure. This is our main goal. Now just to give you an example of how things are not interoperable, if I were using this machine then these infrastructures would have been correctly hyphenated and not the way it is right now. This is one of the simplest interoperability issues that we hope the technology will resolve in a few years. Going back to OpenMinded. In OpenMinded, we try to do the following. The researcher today tries to gather data from all different sources that one may imagine. So these resources, these data sets, let's call them content or corpora. One can query various repositories for corpora, but it can also query OpenAir as one of the main, if not the main, content hub for scientific publications in their form as scientific data. He or she can retrieve the scientific publications and form a corpus that is the actual data set that he or she will be using for text and data mining purposes later on. Now what does it mean to use for text and data mining services later on? It means that he or she will try to find some services that will do a very specific task on the particular data set. And then of course the result of the application of these text and data mining services on the created data set will be annotated, corpora will be new derived knowledge. This is the goal of OpenMinded and this is what we're trying to build and we're almost halfway or even more there towards the goal. Now why are we concerned with legal stuff? We're concerned with legal questions simply because they pop up almost everywhere, whatever you do. When we try text and data mining activities on data sets, on resources, you have issues that have to do with the accessibility of text corpora. No matter how these may have been built. You have the issue of accessibility of knowledge resources or reference resources that one may use to annotate this corpora. And last but not least you have issues with language processing tools or text and data mining tools, be they in downloadable form or in the easier form of web services that will operate on the data set to generate new knowledge. All sorts of legal, if you want questions, pop up here, whether you are within the copyright scope, whether you are infringing copyright, whether the suite generation database protection applies or not, or if you're lucky enough and you are in one of the countries that are about to leave the EU where TDM exception really holds true. Then supposing that you are within copyright, you start dealing with legal contracts, what we call licenses. And then you realize that when you talk about licenses, you enter a real jungle, a legal jungle. And personally, if I do not have a pro with me, I mean I don't think that I can sign anything any longer. You have been quite polite, you called it proliferation, normally we call it pollution. Yes, yes. I want it to be politically correct, that's correct. So licensing proliferation and the interoperability of licenses. And last but not least, how can one formalize, how can one tell a system, a machine, what a real license says and what the actual rights are so that the machine can understand. So let's see an example. In what we call the infrastructure era, normally a researcher does the following. He pulls content from publishers, from a repository of a publisher or he may use repositories like CORE or he may use content hubs, as I said before, like OpenAir. And then he aggregates, he or she aggregates all these publications into a dataset that we can call for our purposes here, research data. This is not the end of the story, this is just the beginning. What does he want to do with research data? Let's get an example. I don't know whether these are really readable, are they readable to you? Okay, right. I cannot read them that far, but I think I remember what I wrote yesterday. So suppose that what you want to do is to extract entities. Entities are very important in research data because they're the main actors in the research discourse, entities or concepts. And then you may want to extract relations between entities. And to enable entity extraction and to enable the extraction of relations between entities, you need some sort of linguistic annotation. Now all these blocks may appear in a software repository, GitHub, Mavin, whatever, in an order fashion, what the researcher has to do is to put them in order to create what we call a workflow that will be applied on the dataset. So he aggregates the different pieces together and creates a workflow, a data processing workflow or a text and data mining workflow, excellent. In doing so, for many of the mining services, we will be using ontologies, many ontologies in the life sciences world. Some ontologies in the social sciences are popping up now, but in addition to ontologies, one may decide to use lexica or models, models of the language or models, particular models of the research discourse of the particular research area. Right. Now let's see what happens from the legal point of view when you try to get the scientific publications from publishers or from core or even from open air. You get a license, one or more, you may get terms of use for data or for services, or you may get even not licenses but just formal statements or even not formal statements, just data tags, tags that tell you that, okay, this is public or this can be used for non-commercial purposes, et cetera, not that all the tags are in most of the cases standardized. You may have free text statements, just free text, okay, very difficult for the machine, to realize what the intended meaning of the free statement is, therefore no clue whether the actual data set can be used. And in many cases, I dare say you may have even no license, pure agnosticism, pure indifference to whether the particular object can be used, okay, resulting to the researcher being held hostage due to the uncertainty that there is no legal tag. It is even better sometimes to say, no, no, no, this is closed, don't touch, than just to leave the researcher in an ocean of agnosticism. Or you may also have contractual agreements. Now what does this mean? This means that the researcher has to read and accept the terms of use of various components of the workflow, okay, creating most probably a jungle in his head, first of all, and later on most probably deterring him from going on with his research task, right. And I remind you that this may be avoided today, partially, only partially, in the UK. We do hope that things will change, France is coming up, maybe other countries are also considering adopting a copyright exception for text and data mining, right. Now in open minded, what we try to do is the following. We have this virtual circle where from scientific publications and research data that you see on your left here, we push these data sets into processing workflows and we generate annotated data sets, we generate new knowledge, this is the goal. Now in terms of the legal complexity what we try to do is to compute automatically summaries of licenses and terms of service so that we enable the researcher to accept the terms of service and licenses at one single step, if possible, but also to guide him as to what the allowable licenses of the annotated corpora of the derived knowledge might be based on the legal calculus that is more or less inherent in the legal framework, right. How do we do it? We have adopted a layered approach, we have a three layer approach. First of all, we deal with interoperability between data sets. So we constrain ourselves in the world of content. Second we go on with interoperability between processing tools and services and this constitutes the first layer. In the second layer we try to see how interoperable the legal framework that governs the content, the data that is mined, is with the legal framework of the processing tools and processing services and last but not least, and this is a herculean task, this last one, we try somehow to standardize the different legal notions by adopting some sort of rigidity in the semantics that are involved behind the legal terms and we try to do this at the level of the licensing conditions. Last slide, implementation, how we implement the whole thing, we started with the compatibility matrix so we compare licenses together, if you want we disintegrate the legal documents into a set of primitives, legal primitives and we try to see how these legal primitives are used in the different legal documents. Then we try to go on with the Creative Commons good practice where we have human readable summaries, we have machine readable code but we also have the full legal code of the document. We try to harmonize the vocabulary as much as possible just to tell you that, I mean even in organizations and in frameworks like Creative Commons and the Open Knowledge Foundation, the basic legal primitives are not defined in the same way, this already creates interoperability gaps and we try to generate as machine readable legal metadata as possible so that we enable machine actions in the future. To do that, we do not operate in vacuum, we use the ODRL, a W3C recommendation and there what essentially we do is we somehow decompose again legal documents in a formal manner this way using RDF as the representation language. If this is all done, then we hope that one day we will be able to draw the conclusion that open access, which is today rather ill-defined term, really means free to mind. Thank you very much. Thank you very much, Sergio. I think these two presentations are quite complementary in the sense of Peter having told us basically what is the necessary work you need to do before you actually, that's how what you should be doing so that someone else can actually share meaningfully, you can share meaningfully your data with others and I think what is interesting is that Stelios talked a lot about copyrights whereas Peter gave us a personal data dimension which is something we very frequently neglects but increasingly is coming very strongly.