 This conference will now be recorded. Hello. Can you hear me? Yes. This is Irene. I can see that we are already eight, so I suggest that we wait five more minutes for the Challenge Champions to connect. And then I will start the meeting. I will begin with the introduction. Okay, so we're waiting five more minutes and then we will begin. Thank you. This conference will now be recorded. Okay, hello. Let's start the meeting. On behalf of the OpenAir Advanced Team, I would like to welcome you to this debate session. The aim of the session is to give you an overview of the three OpenAir challenges to explain to you certain technical characteristics and to answer to your questions related to possible technical or strategic issues. My name is Irene Karacharistou. Today we'll manage the discussion so as to be in line with our timeline. So, since I presented to you the scope of our meeting, we will start with a presentation. I believe that you can see the second slide, the agenda, and what, how we're going to proceed after the scope of the meeting that we discussed already and the OpenAir Advanced Long Term Strategy. At the second stage we will proceed with the presentation of the challenges. So, we have our challenge champions here that they will proceed with the presentations. Last but not least after the challenges, we will listen to additional comments by the challenge champions. And then in order to conclude the session, I will give you an overall framework of the OpenAir PCP call for tender and the annex is related to the call. The meeting will be completed with a roundtable discussion where you as participants can address any question related to technical and strategic issues. So, if there aren't already some questions important as far as the, the, the overall administration of the meeting is concerned. I can leave the floor to the first presenter, Mrs. Natalia Manola. She's a research associate at the Athena Research Center Research and Innovation Center. She will present to us the OpenAir Advanced Long Term Strategy. So, thank you for my side and Natalia, we are ready to hear you. Hello, hello everyone, and thanks for joining this call. Can you make me presenter, Irini? Yes, of course, one minute. Okay, so now you can see it, okay? Yeah. Okay, so I will go very briefly because, you know, I think I assume that quite a few of you know of OpenAir all these years. I will go briefly and say what we aim to do with this open goal and why we're doing that. So, this is OpenAir. This is a key initiative on open science in Europe and OpenAir Advanced is the fourth project in a row funded by the European Commission. Let me see. So, in a nutshell, OpenAir, we are doing what we're trying to do. We're trying to build and operate an infrastructure about open and reproducible science, nailing down around the scientific scholarly communication. And we have two branches. One is an IT branch, which we provide services and Paolo will probably explain a lot about it, but also to engage with key experts around Europe. As we believe that open science cannot happen without help. So, this is about data infrastructure, about services, content, whatever content of research outcomes it is, and it's about interoperability and linking things together. And many of you may have heard about the European Open Science Cloud, which is a key initiative in Europe. On how we can provide a seamless environment as much as possible for researchers to do their research. And in OpenAir, I'm not going to go briefly about it, but I'm going to go in length about it, but we are a consortium of around 50 partners, experts from Open Science, legal experts, infrastructure providers, IT, our Open Innovation colleagues here, and we have also some participants who are doing citizen science. But again, it's about open science experts, IT providers, data community, legal experts. Our key characteristics is that OpenAir is a distributed and participatory infrastructure, meaning that what we are aiming to do is to connect all different actors across Europe. And we started from open access to publications, and this still remains one of our course, but we're moving slowly to a holistic view of open science. It's based OpenAir as an infrastructure is, we know that each level of the infrastructure has different, each infrastructure has different layers. OpenAir is based on existing initiatives, national or institutional initiatives. And what I would like to say here and stress this is the repositories are of key importance to OpenAir because these are publicly owned services. And as we are building an open infrastructure, of course, we want to allow SMEs and companies and commercial companies, commercial entities to be involved. And this is why we have the Open Innovation, but it has to be on the principles of open science. And the third key characteristics is that we support all types of research outcomes, not only publication. And we want to have what we call the links open science, the linked open science idea. Three pillars of action. We, we work on policies on training and on services. This is what OpenAir does. We have three main pillars and the 50 consortium members are working on these areas and the policies can be on the national level or, you know, when we're trickling down to the institutions, or to the research infrastructures, same for training, same for services. How we operate, we, we work on three, we have three different levels of operation. National approach is our key approach. Then we are in Europe, we are approaching thematic research infrastructures, meaning that if you have research infrastructure in a specific domain, let's say social sciences or humanities, we work with them because they are very well organized cross countries, cross country in Europe. And then of course we have the global alignment. We have services for all stakeholders and I have here the B2C, B2 client, business client, business to business, and how we see our views. So if you think about researchers or clients or funders or providers or clients, we have the support at the low level where we're trying to reach researchers and scientists. We have four services here. One is Zinodo, which is a repository hosted by CERN and co-funded by CERN and OpenAir. Amnesia, which is a service for anonymization data. And now we're in the process of developing data management plan service. And of course, and those are the three IT services that we have for our researchers, but we also have Pan-European help desk and training distributed service. The next level is the B2B is we are targeting the content providers where we have four services. Skoll Explorer is one service, which is data to literature and data to data links. The interoperability guidelines, we have the validation people registering OpenAir, the content providers register, we have the interoperability guidelines, and we validate and register them through this. We have the usage analytics and the open access broker. I think Paolo will go more into detail. And then we have the value added services based on what we do, which is about research analytics, monitoring dashboard and APIs. And in a nutshell, this is what we do on OpenAir. And my final slide is we have on the technical side, because we're not going to go into the training and the support and help desk. We have a few strategic priorities. So the overall strategic priorities to change scholarly communication towards openness. And what we believe is that there are services for all stakeholders. So how scholarly communication involves researchers, involves institutions, publishers, data sources, content providers, funders, policymakers. What we believe is that OpenAir and open scholarly communication like OpenAir should have services targeting any of these stakeholders. Then what we find out is that even though we are talking about open science, researchers really have a problem in taking it up. And we believe that one of the reasons is that there are no needs services for them to take it up. So we're looking for innovative ways of engaging researchers in open services through needs services, in open science through needs services. Other strategic priority is interoperability. So because content providers is a key stakeholder for us is, and we are talking about a seamless access and seamless integration or, you know, any, any buzzword that we can think of interoperability is of key. We do not want silos. So how do we have services that allow us to go towards a similar space. And then once we have open science and open access publications, data, software, whatever is, how can you enable some intelligent and contextual discovery. And this is one of the things that is coming is around the corner, especially with artificial intelligence and the buzz that is going about it. And the two other strategic priorities that we have is monitoring and interactive visualization because monitoring now targeting funders and policy makers is that they are giving a lot of money to open science. But how, how are they willing and what are the commitments that they are able to put in this open science relies a lot on how their policy goes along so monitoring is is very is key to them. And of course, as we are creating this, this huge graph of millions and millions of records all linked together is we need to have services that embed quality in this open air graph and the processes behind it. And this was the last slide and what I wanted to say is that we have this open innovation call one of the reasons is that open air until now we get our funding through the commission so through through the European Commission through project funds and it's very hard for us every time to include new partners or new services because we are getting paid for operation so we're trying to find a way where through this open calls, we are including players and innovative players who can participate in the shaping up of this infrastructure in a way that they can can commit but they also, you know, they can start of thinking of business models and ways of having sustainable services, which go beyond project money. And I think this is all I wanted to say for the moment. Thank you very much, Natalia for your presentation. Okay, in order to be in line with the agenda, I believe that we can proceed to the presentation of the challenges. And if the participants have already taken some notes they can address their questions at the end of the overall presentation. So, if I'm correct, we will start with challenge two and three in order to give the floor to Mr Paolo Manchi, because he has to live earlier. If this is the case, then I can make you presenter. And we can start with the presentation of the challenge too. We cannot hear Paolo. Wow, can you see the slides? Yes, personally, I can see the slides and I can hear you perfectly. So thank you for coming and thanks Natalia for the introduction. I will just follow up with challenge two and challenge three. So the idea is to present the technical layers that will be in building in open air in the last 10 years. So just go through them quickly. So it's like a view from the moon, but you can stop me anytime. I think also you can also write questions just in case. I'm not going into enough details for you to answer such questions, some questions. Okay. So first of all, this is not working. First of all, yes, okay. Let me first give you an idea of what we've been doing. So the focus of open air has been for a long time that the one of monitoring how scholarly communication is performing basically. So the is it's about grabbing aggregating metadata about scientific products collected worldwide from thousands, tens of thousands of sources. And in order to be able to monitor how the open access and non open access trends across science are doing are performing. So the original question came from the funders, especially the commission who wanted to monitor and supervise the effects of the open access mandates on the scientific community. And then things have changed open science came over and took over in fact so the space of interest for us and for the scientists move not only to not was not only focused on scientific publications but also to other kinds of scientific products like data like software and so on. So open air followed up in the longest line, we develop services that are now collecting metadata about all these kind of products. So on the one hand, we are trying to collect everything that is trustful enough. We are endorsed by the communities and by the scientists in terms of metadata and build the connections between these objects so we try to connect publications to data publication to software or or collect links. Among these products when they exist and they're provided by the original sources and by the scientists. On the other hand, we also look at the other side of the moon since our main customers were the scientists in the beginning together with the organization, the funders only together with the organizations. We are also collecting information about funders at the national level European level but also across the oceans that we, we have around Europe so we have funders from Australia from the US and from several nations in in Europe. So we also do our best to collect information about the projects at the granularity level of the project so that we can glue the funding efforts so the project themselves and the grants with the results of such efforts such efforts. So, we're basically building a graph, the graph that Natalia has mentioned before that serves can serve both scientists on the one hand to publish science in an open science fashion so all digital products interconnected, but it can also serve funders and their professors and institutions in monitoring how science is doing and the, for example, the research impact of their, their efforts. So if you look at a picture at the bottom, you have an idea of the kind of sources we are collecting which range from publication repositories data repositories, data repositories, or registries of informations of information like, or kid for the authors or grid dot AC for the institutions, these are databases that contain identifiers and information regarding key entities in the scholarly communication domain, like, again, authors and institutions, but also we collect from software repositories and several others publishers and so on. And we try to impose as much as possible what we call guidelines and this will probably treated in the challenge one. But we have guidelines which are basically instructions on how to export the metadata about the scientific products, which should be acquired by the content providers and implemented by them in order to align as much as possible the scientific production. And these are actually took over quite well in Europe and beyond, especially on the side of publication repositories, several platforms are today implementing the open air guidelines, so we can count at least on a very, on a first level of homogeneity. But we're also collecting from what we call research infrastructures, these are huge investments from the Commission, which are today, which is today investing in the communities so that it can develop their own thematic services, digital services to perform science in a specific community. Even there we collect metadata about objects, and we include them in our graph, for example, virtual appliances, virtual machines in general which are described as scientific products, representing for example an experiment so software and data together, and so on. Now, when we collect those objects we build the graph that you see drawn here, drew here. Of course, it's a very high level view for the data model you have to rely on the documents we've published on in Zenodo, you can find them there. And we perform a number of actions from harvesting to cleaning so basically harmonization of the metadata to converge to a common data model, the open air data model. We deduplicate those objects and so on. And we are also brokering information around. This means that when it is possible, since we know where each piece of information comes from, for example, a record from a given repository. If we, if we can perform any action that enriches this record, we can provide information back to the original source so that they can reach locally the information. On top of these information space that we build, we have several APIs, several kinds of standard APIs that people can use, but we also use to provide the services that you see on the left. Connect, explore, provide, develop and monitor, which are devised to serve specific kinds of customers that you can see written there. I tried to go into the detail later on. Now I want to tell you a little bit that we're just going one layer down, which is not enough to get your hands into the code, but that gives you an idea. We have five subsystems in open air. One key one is on the bottom left corner is the aggregation subsystem. From, from here, we built a number of services that allow a team of people to include new data sources in the aggregation process, collect the metadata, transform and clean the metadata. So we have this separation between native records and transform records. We can also collect the files. So the PDFs, we are today reaching around 10 million PDFs. We extract the full text from this PDF. And on the right, you see the information inference subsystem. This is where we perform mining algorithms over and text mining over the files in order to find information that is not available in the original metadata. And this goes down, especially to links, links between publications and datasets, publications and projects, publications and software and so on. This is actually one of the real added value of open air today, especially in terms of monitoring. Links are not often there. They are within the text but not in the metadata and we factored them out and put them in the graph. Then we have this data provision subsystem, which is the place where we manifest we generate the graph for the first time. The graph is stored in an HDFS actually in an H base installation. It's a cluster with 12 machines and that's where we process the graph. And we process the graph to add the information that we infer from the information inference subsystem. So we add the links. They were not in the original metadata, but it's also the place where we integrate the results of the duplication subsystem. The duplication subsystem splits the graph in, let's say, flat collections, for example, the publications, the datasets, the software alone, and performs the duplication within those collections, trying to identify the objects, the groups of objects that are similar to each other. So equivalent in fact. So this information is thrown back to the data provision subsystem that creates out of the similar objects, one objects, which is in fact the representation of all the enrichments that we can collect from the equivalent objects that we had. And that also includes all the links that, of course, were outgoing the objects that we had to merge. So in this way we respect the topology of the graph. We disambiguate it because we are providing statistics after all and duplication is a bad thing. And we go on and we produce the dedupe and then race graph. The graph is still resides in a HDFS so in the cluster so it's not really clear able. And from there on we produce yet another representation of the graph in different beckons, depending on the kind of services we want to provide For example, we produce a link to open data representation, full text index representation, which is the one you can access from the portal. And no IPMH beckoned is being built and statistics are built in a dedicated database so that the front end services connect, explore, provide, develop and monitor can take advantage of this. Even this process is, of course, quite delicate one since we are producing graph in different beckons. And this is really big data. We need to make sure the results in the different beckons are aligned. And we also need to make sure that this is like a distributed transaction. So we need to wait for all these processes which are being run in parallel to complete. We need to check the contents inside the different beckons is aligned in terms of quantities. And once we are fine, we switch basically everything that is in production goes shadow and everything that is shadows goes in production. We use open source technologies all over the place we're mainly based in Java. We use Java and a little bit of Python, a little bit of Pearl here and there is like reminiscence, but the majority of it is Python is about three million lines of code and Python in the sorry in Java, apologies. We won't stop here very long, but we have quite large infrastructures and what we hope is that the results of your projects will be integrated here. So you will be probably asked for a further final integration process to help us to integrate your results in our production system if these are consistent and really provide added value to what we do. We are about 40 designers, engineers, developers, so just to give you an idea that it's not a double click installation system what we are providing includes also people from the data center we rely on in Poland. Now quickly on the open science graph so we are building this big graph thanks to text mining thanks to harvesting metadata and thanks to the deposition of objects in Zenodo that Natalia mentioned before Zenodo is what we call a catch all repository so you can store in there. Software data set publications links between them associated into projects associated into communities you can do several very nice things in there it's open and it ensures preserve preservation for the future so this key for those who are homeless over repository like this and it's for free up to 50 gigabytes which is very good if you want to go beyond that then you have to ask monetary or institution and contact Zenodo 50 gigabytes is enough in the majority of the cases so we build this graph. We, in the case of harvesting, we are trying to do a fine grain collection of content from the original repositories for those who are closer to this kind of activities it's simpler to understand, but I think is that in general a publication repositories is considered as regarded in all these aggregation systems as a container of publications this is not really the case so publication can be split in different kinds like data sets software and so on, and this is the same for data repositories, for example fiq share contains the majorities are data sets but also contains articles so in our aggregation system. We try to find grain this make to make it this distinction at the fine grain level of the records of this kind of typologies and this is key to know. Okay. So transformation workflows are basically made of two steps collection and transformations, so we keep a copy of the native XML, and we have the cleaned version of the XML. The clean version of the XML is the one that then is thrown into the graph. So we have these data collection workflows key to know is, sorry, key to know is that these data collection workflows are each of them is associated to one specific data source. So we have the history of all the executions of the workflow relative to one given data source. This is very important so we have the numbers. So we can perform sort of quality checks here. So any solution in this kind that may come up to your mind is welcome. We are in the process of moving from XML to JSON so this is something that you have to take into account XML is quite is the lingua franca in the context of repositories unfortunately, but internally and internally we have the XML for the transformation workflows but we'd like to move to JSON. So this is, again, another input you may find fun. Interesting. And monitoring is key, as I mentioned before, we track provenance so we know where his record comes from we track the IDs of the workflow we track the type of products we have and so on. We lose control of all these things. It's really hard, for example, to detect where it never comes from the granularity of the record and we have to go back and manually do these things. And this happens and it takes a long time, and it's not really what we would like to have so we have in the world, but the roadmap, these things but as Natalie has said clearly, we cannot implement everything so this is another suggestion for you. Content comes from all possible data sources that are trustful enough out there. And as you can see what I wanted to highlight here is that we also have this notion of country and community that we rely on and that we believe is actually key for the kind of services we are providing. So we want to tag the entities in our information space as being part of one or more communities or as being, let's say, attributable, not sure this is in an English word, to a country, meaning that a scientist of that country has provided input to build that product. This is actually key to build aggregation and research impact monitoring services at the level of the nation. So these are two schemes you have to take into account, which are realized, implemented, they exist already in our in our graph. So a country is linked to a founder to a source organization to a product and the community is linked to a product to a project to a source. And we apply all sorts of propagations of this content, right? So if we know that a source belongs to a country and this for example an institutional repository so it contains objects produced by scientists of that specific institution, then we can reasonably claim that all objects in that source, all products in that source belong to the same country. So we can explore also this kind of activities. The duplication, we are running the duplication, these are the kind of numbers we have. We start from hundreds of millions and they go down to much less than this and this is because we have specific criteria for the kind of the duplication we want to apply and also because we have a very high ratio of duplication. Consider the same author deposits in at least one or two places if you include the publisher and the institutional repository, and if it's co-authoring this is multiplied by several co-authors. So this is why we reduce the numbers to that level. The duplication is an issue at the level of the data, the software, the organization, so any solution in this context is more than welcome. Mining, as I mentioned, we have not produced an HDFS where we store these 10 million full text, just a full text, and we have a rigorous way to accept algorithms that can run on that cluster. So what we are asking you, if you're going to work in this scenario, is to produce an algorithm that is Java-based that performs text mining operations that provides output that is of interest. You can do it on your own test working benches, but then at some point it will have to be integrated in our system. And so this requires at least that the algorithm can be easily parallelized, for example. So take that into account when you do your operation. There are several things you can do and these are suggestions of what we do already, but it can be done in several other things. So we can find the semantic of a link from an object to another, or we can identify links from a publication to another object. And any of this action that brings more content and content that can be used to provide better services is welcome. Context propagation is what I mentioned before. These are just two examples. So how a notion can move from one object to another thanks to the fact that you have relationships between the two that allows to do so. And we have the semantics, which is the one we adopted from data site, and we can take this into account in applying context propagation. This is how you can access the graph. If you go to develop.openair.tu, you will find several ways to access it via APIs, via book access, via link-to-open data. We have also two dumps, DOI Boost and Skoll Explorer, so you can collect a lot of content. So DOI Boost spins around the publications in general from Crossref. It's an integration of Microsoft Academics on Paywall, Crossref, and Orkid. So it's a big effort. You can also find the software to do it. You can also find it together with the dump. So if you want to take advantage of the software, you're more than welcome. And everything is in Zenodo again. You can find everything published there. I'll share with you these slides so this can only help you to better understand. There's also a paper describing the content and the software and the data that you can read for your benefit. Skoll Explorer is a service that we have built to serve mainly publishers and all those stakeholders around the publishers. We collect links between articles and data sets from publishers, from data centers, and so on. We have around 120 million links bilateral, so in both directions. And we have APIs that allow any kind of service to quickly resolve the persistent identifiers of these objects. This means that if I have the persistent identifier like a DOI or the publication, I can go to the Skoll Explorer APIs and ask for the objects linked to them, together with the semantics. It's based on a standard Skoll X, which we Skoll X is called, which we have defined together with other people in the domain. And it's pretty easy to use and can give you a lot of advantages. And also for Skoll Explorer, we have a dump. We are going to produce, no, we have just produced a new one. Yes, I can see from the numbers. So you can download it and play with it to build your schemes. So going back now to the services, I gave an overview of the graph, how we build it with some of the open problems, some of the things you can do with it. Of course, you're more than welcome in, yes, I will share the slides. Yes, you're more than welcome in asking more questions by mail to me. I cannot say everything. And again, as I said before, I might have been more specific on one aspect, renouncing another. So please let me know what I can do. So I want to go through quickly on the left side of this picture. So to connect, explore, provide, develop and monitor, actually explore won't be the case. Before I will answer a question from Charles, Charles, you said that full tax were extracted from PDF. Are there, no, no, we cannot. This is, we cannot release the full tax. That's part of an agreement. We have the sources that are providing us with, with the PDFs. You, you, you can, what, what we can, what we can try to do, but this is, this will take long probably is to ask if the full tax that we extract can be shared because they really care about the PDFs. They want to be the only ones distributed in the PDFs around and this is very reasonable because they need to count the hits and the downloads. Maybe for the full tax, they can make an exception, but it's a very long process and we cannot really promise anything on this respect because we need to go to them back to them and their hundreds. So it's not so obvious. Maybe we can. And Natalia is a question for you go through open minded and see if there's something that we can do there taking advantage of agreements you already have in the context of this project. But we can discuss that afterwards. You're welcome. Another question is do you take into account references in the publication. Yes, we also extract references from the publication. So for those for which we have the full tax, we extract the references, we try to link them to our own objects when this is possible. So we'll also find the link from within the reference to an object in our domain. And we make them available. We are now in the process of sharing this with open citation, the open citation graph. Okay, connect connect. So all the services that we build on the left are basically views of the graph. Okay, views that are driven by the kind of customers we're serving. So connect targets, research communities. I won't go into the detail of what a research community is but consider it as a group of scientists with common interest, scientific interest. This can be easily identified when we talk about research infrastructures. I mean, at least it gives us a straightforward definition. It's less obvious when these leave beyond the research infrastructures and they're just out there in the in the lose right so this is not so obvious but for those that we are serving. Often we have a research infrastructures behind or some kind of strong initiative. And for them would be views of the graph. Where the basic idea is that we are trying to tag objects in the graph based on the pertains to the community and we do this automatically. Or of course we can do it by having users coming and saying these objects belongs to my community. So you can see screenshots here. And for example, this is the European marine science research community and you can see we have thousands of publications research data software which have been identified as being part or of interest to this community. So this is really about sharing and discovery. Okay. Criterious for inclusion can be configured by the administrators of this research community dashboards. Then can fine tune our tools to include objects in the view. This can be done by identifying subjects, subjects of known vocabularies, for example, mesh, ACM, GUI. This can be done by provenance so you can list the data sources that contain object of that community, plus criteria to filter out scheme of the objects that do not belong to the community and are still in the data source. We can provide a list of communities in Zanodo. Zanodo organizes its own information space by community. So community can be created in Zanodo and whenever you deposit an object you can make it part of the community. Community are different in this respect from the one of open air. They have the same name. It's not simplified things but what you can do in an open air resource community dashboard is to specify which are the communities whose objects pertain the original resource community in open air. Then of course projects. You may have projects that are strongly bound to a resource community and the administrator can list those. So easily speaking every object that belongs to or is associated or funded by that project belongs to the community. And then we can also propagate. Again, as I mentioned before, via relationships. So a publication that is supplemented by data and software means that if the publication belongs to the community then the data and the software belong to the very same community. You can with a supplement by relationship you can propagate a lot of semantics of course. The aim of research for research infrastructures here is which can use the communities is to measure impact when we focus only on the objects that have been produced thanks to the existence of the research infrastructure. So this is how useful is the R I for for the world and but can also focus of course on discovery when these criteria is constraint is lost and we include basically all objects that pertain certain scientific aspects. The funder dashboard funder dashboard has to do with monitor and of course serves the needs of the funders and these guys want to know basically how much has been produced. How much has been produced by which project by which funding stream when this is often the case and especially was the open access open science focus of what has been produced. For example, how many research data have been produced. How many of these are open. What's the ratio of linking with publications for example which is key to better interpret the data and the same for software and so on. So these are all aspects that we're trying to monitor. We are serving tens of funders. So you have the list on the right from the international funders to European funders and they're queuing in fact because the services are very useful. Again, funding impact regards publications resource data and so on published thanks to grants awarded and open science and open access impact as I mentioned before. There are several things that you can do like trends in resource fields open access versus open science behavior so ability to attract cross funder grants and projects and so on. Again, you can take a look at these slides afterwards. I won't go into the details of this, but these are all open questions that we have in mind and may suggest something to you. Provide. Provide is toolkit that we have provided, of course, because it's provided for content providers dashboard for content providers, content providers need if they want to give content to open air. They need to have a place where they can find everything regarding their registration their activities as part of the project. As Natalia said we're participatory so we leave thanks to repositories so repository give us content. We want to give something back in terms of quality of what they have and in terms in general of service provision improvements. So there are four things that we do. Of course they can register they can validate and monitor the quality of their information. So when we harvest it we can rank the quality of the metadata that we have collected versus the open air guidelines that they're supposed to comply to. We can enrich their content. As I mentioned in the very beginning, we can find information that is important or relevant for the repositories that we can send it back to them. And then we can also we have we are building a framework for user statistics. This is already in place. So repositories can send us statistics about for example downloads visualizations of their content, which we aggregate at the central level. In open air. So for an article we are able to know for example how many times has been downloaded across different sources and this information can again be sent back to the original repository. This is the way the service looks like data source registration is there validation content and enrichment content notification and data source usage, which are metrics. No need, no need, no need. You can take a look at it. Okay, you can validate the literature repository data repository or a Chris system today. These are the three kinds of repositories that we are serving of data source, sorry, data sources that we are serving. So if you're the owner of one of those you can come to open air. If your repository is registered in open door for literature repository or read three data.org for data repositories. You can claim you are the repository manager and you can start testing how it matches the open air guidelines according to the expectations. You can once it's in the process of aggregation check how many times it's been harvested which were the errors that were produced and so on, and all this kind of services. Most importantly, you can access the broker and the broker is again a way to provide information back to the original repositories. In simple words, what we do, what we do is to for since we know that a record comes for example for repository a and we collect the same record from repository be when we the duplicate these two records and we put them together or when we attach to the record extra information that we have mined, for example, a link to a data set and so on. We can easily calculate the diff between the original record in repository a and the rest. So everything that is extra, we can send it back. We create an enrichment event. That's how we call them. We have two kinds of events. More and missing more means that you already have a data source. But you don't have the one that we found missing instead means that you didn't really have that value and we found one. And these are effectively subscriptions. So you can search the space and see what kind of events we generate. And then you can subscribe to specific ones. And when you do that we send you emails. Whenever we generate new content saying, Look, this is new for you, if you want, you can go and download it. So you can search the space and see what kind of events we generate for your repository and then you can subscribe to specific ones. And when you do that, we send you emails whenever we generate new content saying, look, this is new for you. If you want, you can go and download it. Okay. For example, links to projects. This is the kind of things or a DOI you don't have or abstracts that you don't have. These are quite common information types that we return to the repositories. For the metrics, there's a whole framework for that. You can find information online on the OpenAir website. If you go to provide, you will find it. These are based on standard software that has been devised exactly for that. And you will find any explanation on how to install it, how to participate to what originally was a pilot. Now it's a production service so that repositories or the repository platforms that you want to develop for your maintenance day can be equipped with what is necessary to send the statistics to OpenAir. Okay. This is an example of statistics, the way we collect them. For example, for this specific repository, how many page views, how many articles. Because this is an overall view for one single repository. But you can have statistics, of course, at the level of the individual records. Okay. Just forget about this. I think it's just more information about the kind of events that we have, subscription notification. Okay. I think I'm done. I have two questions here. Is ResearchGate included? No. ResearchGate is not included and will not be, most likely for policy reasons in general. They tend to be borderline in terms of copyrights and rules. So there's a lot of debate about it which we want to totally avoid and be outside. Are annotations data or literature? Can you be more specific here? What do you mean by, so if, what are the annotations you are referring to, Neil? Hello, by the way. Neil, how do you mean as a type, as a type of object? We cannot hear you if you're saying something. Yes. Neil, we cannot hear you. But if you refer to typology, we are not collecting annotations directly. So annotation in a context like open air would be attached to the data or the literatures to be external entities, not considered as data or as literature. Then, of course, if you have a collection of annotations that you want to use for your own experiments, you can publish it as a data set that it's always possible. Also a group of articles like the full text that we collect. If that would be possible to be published, that would be a data set, of course. Yes, yes, of course. If you publish annotations as research output intended as an object that could be processed by experiments, in this case, you would call the collection of annotations as you would have a data set, you would give it a name, you would publish it wherever you want so that others can share it and link to it from their articles and so on. Well, no, it really depends on the final usage you have in mind. So any object can be considered as belonging to different categories. It just differs the way you published. You publish it. So if you publish it for reading, because it's narrative, it's literature, there's nothing bad in doing this. If you publish it for other reasons or you refer to it for other reasons, then you can give it another hat if you want to make sure that your science is repeatable, for example. So in general, it's much better to make a distinction in terms of typology so that your relationship to the object gets more meaningful. Or you capture the meaning in the relationship itself, but then you're making a much more complex space. So in this case, for example, I suggest to people to duplicate the information. So to create an object that is literature and an object that is data, even though it's the same, to simplify the way people will access it. Of course, this is tricky. It is tricky in general, but this is not the kind of problems we are trying to solve. It could be, yes, it could be both. Paolo, just for the record, can you repeat the question? Because I think that the question just for the people that are going to hear later our recorded discussion here, yes. The question from Neil was if annotations should be considered research data or literature. And this is a tricky question, of course. And it's a problem that comes over and over again. Because, for example, when annotations are part of the text itself, so critical addition, for example, then it's really hard to say if this is literature or data. I would say in general, this was my reply to Neil, that we should, when we publish this kind of information, try to make a clear distinction for those that will reuse it afterwards. So if your annotations are here to be used as input or data to a process, then it's much better if you give them this type. If they are intended for reading for literature, then it's much better to give that kind of type to the point that you may have copies of this, which are different in terms of types and not in content. In general, I would tend to simplify reuse and prefer privilege reuse. But this is not a general rule. So it's just mine and not so relevant opinion. Then another question is how you treat different kinds of open access, gold, green. That's a great question. We don't deal with them explicitly because we have access rights in general. So access rights, according to our vocabulary, are quite simple. Like open, closed, restricted, embargo. There are combination of things that allow us to draw conclusion on the fact that it is gold and green, accessing Romeo, for example, which is a service that provides for different journals the kind of license that lay behind the journals. But this is not always obvious. For our own purposes, we produce statistics of the kind and it's not obvious because types are misused most of the time. Consider, for example, archive. Archive has preprints and postprints, which would help a lot to make this distinction. But when they export their data, they're just preprints. They don't make any distinction. And we're talking about millions of objects, right? This would really simplify our life. Thank you very much, Paolo, for your presentation. If there are other questions concerning challenge two and challenge three, I believe that we should answer them now, since you're still here with us. Or if there aren't any questions, then I will make Mr. Johan Schirvagen a presenter for the challenge one. Okay, I believe that we are okay. Thank you very much on more time. And I will change the presenters now. Okay. Hello, everyone. I will give an oral presentation and I will share a summary with you after the call. So I have nothing to share. I don't need to share my screen. So the challenge on next generation repositories is actually related to an initiative of the Confederation of Open Access repositories, which established a working group almost three years ago, with the aim to overcome several issues of repositories lacking web-based integration with other innovative scholarly services. And the goal was to identify new or missing functionalities and technologies in repositories and also to change the focus from repositories as hosts of scholarly items to the items themselves and to make them more web-centric. The idea was also to consider the potential of repositories as a globally distributed knowledge network, which can proactively help to promote a transformation of a scholarly communication ecosystem, which means that the focus of this challenge is really rather on the repository level than on the, let's say, traditional journal level. In the context of open-air advance, we picked a few functionalities or behaviors based on the recommendations of this core working group, focusing mainly on aspects of resource discovery, navigation, and content transfer, secondly on open metrics and trying to follow approaches that include but also go beyond user statistics in order to assess the impact of scholarly works. And the third aspect was on annotation of content while this was already discussed, the issue was raised by Neil, what we mean by annotation and actually this is a more broader meaning. So, originally, that annotation service in open-air will be based on results of Open peer review of comments and also annotations, but it could also mean to include events or activities, streams, for instance, from social networks dedicated to researchers. Coming to the first aspect of resource discovery, navigation, and transfer, we are still at an early stage here. We focus on two technologies, resource sync and signposting, and with regard to resource sync, Open-air implemented a client that can connect and collect resources from the core aggregator in the UK, which has established a resource sync endpoint exposing lists of resources about metadata and full text links from a few publishers about open access journals and articles. In the previous phase of Open-air, we also had an open tender call for services. In this context, some platforms were supported with an implementation of signposting patterns. This was done by four signs, supporting, for instance, OJS versions 2.4 and 3.1, and it was supporting signposting patterns to expose information on authors of a publication or references, bibliographic metadata or the publication boundaries and some other patterns. And there was also an implementation of the resource sync framework, version 1.1, for the space versions 5 and 6, and an implementation of the resource sync framework for the Samvera repository. This is also a reason why this challenge suggests especially an implementation of resource sync for e-prints, because this platform currently lacks the support of this framework, but also the implementation of signposting for the space versions 5 and 6. With regard to open metrics and usage statistics, Paolo already introduced the Open-air metrics service. However, there is still potential to improve the service and to improve the way the statistics is visualized. Regarding this activity, we collaborate here with Arrows UK, which are using their own tracking service and tracking protocol, and in Open-air we are using the Matomo analytics service. This has the effect that we are actually following different tracking protocols and also it requires dedicated plugins for different software platforms. We want to overcome this issue by aligning these tracking protocols and also to provide a more generic tracking plugin to reduce effort of implementation and maintaining of this plugin for many different software platforms. The Open-air user statistics service is able to track usage events from individual repositories, but also open access journals like OJS, or alternatively to collect counter reports. We currently support regarding the cleaning of usage events, counter code of practice queries 4 and are planning to support queries 5 and counter for research data in the next few months. The statistics is not only represented in the portal on the level of data sources or on the article or item level, but also exposed via SushiLite API endpoint. We count at the moment about 70 data sources we track and around 80 repositories where we gather counter reports from iOS UK. Still, there is the need to scale up the service and to promote it more and to include more data sources to participate in usage statistics. One approach which in our side would make the life a little bit easier is if it's possible to establish statistics notes on a national or regional level. Examples are the IRIS notes in the UK and Australia, USA, New Zealand, and also we collaborate with La Referencia on establishing a user statistics note in Latin America. Then the idea is to connect and share user statistics via those regional notes. Another aspect quite important is to support user statistics also in other infrastructures in the European Open Science Cloud. Then what I mentioned as one issue is to identify or to support different techniques and methods of visualization of usage data. This can be done in different contexts, for instance to visualize statistics on the item or work type, relate to author networks, to relate statistics on different topics on the level of data sources or countries. Another idea is in order to go beyond user statistics to integrate with other kinds of metrics and alternative indicators. One example is what could be of benefit is the integration with the open citation data, but also with alternative indicators once they are related to scholarly context. The problem here is that it requires that we not only show, let's say, the number of tweets or mentions but also to provide a certain context in order to interpret this kind of event. Are there any questions so far? If there aren't any questions from the participants then we can proceed to additional comments if the presentation of challenge one is completed then and if there are no questions we can proceed to additional comments from other challenge champions. I believe this is Neil because it's the sound when the microphone is open you can always write to us Neil because we can't hear you. So I can see that so far there are no questions, so are there any additional comments from other challenge champions? Okay, I will take this as a no, so you have something else to add from your part? Okay, so I can thank you very much for your presentation and just to complete this session I will just state a general some things concerning the framework of the open air PCP call for tender just for the participants to have a small idea of the general procedure. PCP means that public procurers challenge innovative players in the market via an open, transparent and competitive process to develop new solutions for a technologically demanding mid to long term challenge that is in the public interest and requires new R&D services. So we chose this form of the call of tender because it gives more space to R&D services to be developed. The PCP incorporates solution design, prototyping, original development and testing. You can thoroughly see through in our text the faces that you will follow and if you have already visited our site you can see that there are also some annexes that you have to complete together with your proposal. Here you can see the deadline for the submission. I just want for you to know that you can send your proposals and the annexes at openair at corallia.org by email and I have received some questions so far concerning the form of of these annexes. So just for your information I can tell you that it is not necessary for you to complete all of them. You just have to see which are the annexes that are suitable with the form of your team and then you can complete the one that concern your profile. You have the deadline and it is important for you to to clarify the form of your team. That means that for example if you are not a company yet and you are a natural person but you are in process to register your company following the administrative procedure of your country it is important that before the deadline the registration for example of your company is completed. That means that if you have a question if you have to submit your proposal either as a natural person or as a legal person then it is important for you to know that the form of your team it has to be compatible with the form with the status with your status as an economic operator the day of the submission of the proposal. If you have other questions concerning technical or strategic points you can address you can address your questions to us now and before your question one more tip concerning the registration of this model Panos you want to tell us. Yes hello so this session is recorded as well as you know and it will be available in the open air course in the model of Coralia so you can you will be able to find it there along with other useful material that it has been collected and will be useful for you to get back to in order to understand more of open air in general. Also you will be able to contact us through there as well for any questions you may have. Thank you very much Panos for your comment I can see here that we have a question I will just read it for the minutes of this meeting so the deadline is really tight and unfortunately placed on a Sunday which is a public holiday throughout the European Union may suggest moving the deadline forward a few days to facilitate wider participation in the call so this is actually the second time that we're receiving this question so I suggest that we will have an internal meeting tomorrow in order to discuss this possibility so I won't give you a formal answer now but all the participants of this call you will receive a formal response tomorrow to these questions because we really need to be fair and to meet your needs concerning the PCP call for tender so we will discuss it and you will receive a formal reply tomorrow and also we'll publish this if we have if we're going to have a change to change the date and then we'll publish it also online at the website of open air so from my side there are no more comments I would like to ask one more time if there are comments suggestions remarks from your side either from the side of the challenge ambience or from the side of the participants and if there aren't then we can close the meeting okay I believe there are no more questions or comments so I would like to thank you one more time for your participation I would like to thank the challenge ambience for their comments and for their detailed presentations the participants for your questions and we will talk again tomorrow in order to inform you what we're going to do concerning the the deadline of the open air advanced PCP call for tender thank you very much have a nice evening