 Welcome everybody. So let me start. First of all, what we are trying to do in open air among several other activities is on the technical side is to build this open air resource graph. Just to contextualize a little, the idea is the one of materializing what we call the open science graph. So the open science graph is intended here as the metadata graph that researchers are everyday building when they store. They deposit their scientific products, their digital scientific products into repositories worldwide or into the publishers website. In the publishers repositories, of course, when they have a paper accepted or when the deposit resource data somewhere or the software and so on. And when they do this, they describe these objects with metadata, which most of the time unfortunately not always as we would like. They also specify links to other objects which are not necessarily stored in the same data source. Whenever we do this as researchers, we act like this. We are implicitly building this graph, what we call the open science graph. So we are interconnecting these objects. We're also interconnecting them with real world entities when we use an orchid ID. For example, we're also specifying that an object is authored by a given author, or we can link them to the research funding behind it or the finder behind it. So even if we deposit a single unit, a single product somewhere in our local repository, we are implicitly taking part in building this huge graph. So what open air is trying to do is to collect all this metadata and links worldwide and materialize this graph. So the magic of this is once the graph is materialized, of course, you can build several applications on top. So on the left side of this picture, you see the high level. So the view from the moon of the data model of open air. So this is just to let you know that we have this notion of product, which is a scientific product. So the results of a scientific process like research activities, which we classified into publications, namely literature. So objects that you read intended for humans to read. Research data, which are intended as the kind of objects that can be processed by an algorithm, by a service, by a process in general. So they're intended for machines in general, although sometimes they can also be read, for example, in the XML, but that's the idea. Software is infectively code. So is what you can compile or interpret, you know, to run your program. So it's different from research data and it's different from literature. And other research products are those that do not fall into these three categories. So the kind of objects that you can find out there like workflows, protocols, and many others, which are mainly community specific. So they have a naming and an understanding that is typical of a community. So we, what we try to do here is to come up with a model that would at least identify the entities that are common across different disciplines. So they have a common understanding across different disciplines. Well, in other research products, we tend to place everything that at least gives us a doubt. So of course, anytime we can try to dig in into the other research products and come up with another entity, which we believe to be cross discipline enough. We have experimented for example with virtual machines that's another kind of class, but we were not successful in the sense that there are different namings and not so many stored out there to justify an independent entity type. So this product, as you notice, we also keep track of the source, which is the source where the product has been stored or the source from which we collected the metadata about the product. And then we have linked to projects projects are linked to funding and funding streams engines in general, and the founder, giving example the founder being the commission, the funding being age 2020 and the project being open air advance, for example. Organizations as you can, as you can see are linked to projects because they are beneficiaries of projects and grants, but they're also linked directly to products so there's no author in between. Okay, so we are associating a product to the an organization if an author of this product affiliates is affiliated to the organization. This is very useful for us for monitoring reasons. Community communities are for us, a very abstract concept that in general we used to say, put together a number of objects which are somehow related with each other in terms of this notion of community so the community can be something very discipline oriented. And in this case, we try to identify the chunk of the graph that is related with this discipline. So for example, the projects related with, I don't know, with European marine science, which we have together with all the publications data software, etc. And so we are related with it together with all the sources that are working this context and so on, but it can be something else. So for example, it can be an initiative of which you would like to measure the research impact. So we have such examples like EGI, EGI European Grid Infrastructure. This is not identifying any discipline. In fact, there's a horizontal infrastructure. And then we identified some part of the graph, which is related with this initiative. Okay, and that's for research impact measurement. How do we build this graph? First of all, we harvest. So it's really collecting metadata from the sources. We have around 10,000 sources from which we collect today. Secondly, we are also following another, let's say another would say crosswalk compared to the past. So we are looking into the research infrastructures, which in Europe are, let's say the electronic and networking systems so made of people and services that are used by a specific discipline to realize and perform everyday science. Okay, so here we have a few examples. And in this case, we try to dig into there because in several cases these research infrastructure have services used by scientists to perform experiments. They actually have digital objects, but they don't store them, right? So they don't publish them. So what we try to do is to catch these kind of services and bridge them to the open air research graph in order for these objects to be revealed out to the world and be interlinked with the rest. And that's to facilitate reproducibility, for example. So we may have the digital representation of an experiment, link to the input data, link to the output data, which we actually assign a DOI thanks to the position to Zenodo and publish in the graph. And this was never done before. Okay, so that's a step ahead with respect to the past. So when we collect the metadata, and this is in very original form, of course, this is not enough to build a graph. So the metadata has to be harmonized. So towards a common understanding of format, vocabulary, and so on. It needs to be duplicated, because in several cases we collect the same metadata records, so describing the same object from different sources. And these metadata records are different. One may be richer than the other, for example, or may specify information in different languages and so on. Then we perform a lot of mining. So the idea is that when we can, we collect also the PDFs behind and the full text behind the metadata record. So we are collecting today around 12 millions open access, sometimes straight from the metadata records because the data source allows us to do so. Sometimes, thanks to face to face and bilateral agreements with the publishers which provide us with their data, we can return back information to them. And via mining, we actually do, we actually enrich a lot this graph. So we obtain metadata information and links between the records that are not provided anywhere out there. Namely, links between projects and product, links between products and products, like publications linked to software, publications linked to research data, or links between products and communities and organizations and so on. So we tend to enrich. On top of this graph, then we offer a number of applications, but others can build their own added value, right, applications. We build tools for monitoring. For example, research impact with respect to a thunder with respect to a community with respect to an organization. But we also offer services to share and interconnect for the resource communities. So the community point of view. So in order to identify the the products and the research entities in general related with the community. We also provide APIs through which you can download the graph, play with it and possibly give us feedback. Okay. The graph will be used to offer the scientific product catalog for the US. So the idea is that the US will provide several, of course, scientific product catalogs, but this one will be probably the broader and largest. Okay. Being cross-disciplined. So, what is the open resource graph? So it's about metadata. So we are collecting metadata and links between the metadata. So, namely, the links between the objects described by such metadata. It's about scientific products. It contains information about open access. So when an object is open access, if not, or at least degree, is linked to funding information and resource communities. So you can see these are the properties out there. So it's open, complete, the duplicated, transparent, participatory, decentralized and trusted. Let me go through this one at a time. So first of all, it's open. We mean that we want to export it as CC0 as much as possible. The reason for this is it's not possible because, and we have to expose it at CC buy because some of the sources imply that. So the metadata that we collect obeys to stricter copyrights. Complete. Complete because we try to include in the graph all sources. These are just samples that we believe are trusted by the scientists. So try scientists use them, use them to find information, use them to store information. If there is one of such sources, we want to have it in the graph. Okay. So examples here are from the open citations to the Microsoft academics and paywall data site cross ref it is a whole arcade, for example, for research entities. But we have many. So the thematic repositories, data sources from the research infrastructures, old publishers in general, especially if no patient open access aggregators greed, for example, for the organizations with the data for directory of sources and so on. When we collected, we wanted to be the duplicated. So the logic behind it in simple terms is that when we find a set a group of right of records that we believe are the same and we do this by matching of course, we want to merge them. So we obtain only one object out of them. And we preserve, of course, the provenance of all the objects that have contributed to this. Okay. And include all the information that we can obtain from these records. So we build a potentially richer record. Thanks to the union of the information coming from the rest. Participatory we want everybody who's willing to participate to the graph so to offer the metadata content providers so they can come and offer the metadata we classify them. According to taxonomy, which I'm not going to describe here but you can access the graph also taking into account. If you want, for example, an institutional repository or data repository or an aggregator of institutional repositories or whatever. Okay. Transparent. We keep provenance as I just mentioned, and we keep also a level of we call it trust so reliability indicators. This record has information about where it comes from. If it if he's obtained from different sources we have all of them. And the level of the fields we know if this field has been in third, for example, by a mining algorithms and when, and we have a level of trust. So for each field we we claim also how confident we are about the reliability of this field based of course on double checking and a validation process of the algorithms. Okay. So that is provided as a number between zero and one. So you can also view the graph potentially based on the level of trust that you want to assume to be minimal. The centralized that's another important thing this is this graph is intended to be public good. Okay, so we are building it today, but at some point, hopefully not but open air may sees his activities and in this case. So the idea is from now to that moment to redistribute the content of the graph to the original sources. So we do this by exchanging information with other graph. Similar to ours, but we also do it using brokering services. So it broke the broker service that we have is capable of given a data source, who's providing metadata to open air is capable of returning to this metadata source. So we found in open air that wasn't available at the original record, and the original data source can actually subscribe to special kinds of enrichments for example, give me the device that I don't have or give me the open access URLs versions that I don't have links to the data sets that I don't have and so on give me the links to the projects. Okay, this kind of subscriptions can be created in open air and in return open air will notify the content provider administrator with the list of enrichments at the level of the records just to be clear. So for each record we can tell you, you can find this and this and this for you to reach it. Okay. This is going on in this very moment so we are becoming an ORCID member. This means that in next year, soon next year hopefully, users will be able to, well today they can log in using the ORCID account but they cannot visualize their profile. We'll include ORCID profiles of users into open air. This means that the model that you've seen before will include authors and especially ORCID authors. This will allow us to do a number of things. First of all, as an author, as an ORCID ID authors you'll be able to send your documents which you claim to be yours, your scientific products to your ORCID profiles and here in the open air graph you have the view of everything out there. So all the data, all the products, all the scientific products potentially assigned to you are there so you have a global overview. Okay. The other important thing is that thanks to what we call the merge and the propagation techniques that I'll show you later, we can actually recommend a lot of results, scientific results which are not today in ORCID to the authors. The authors can actually take advantage of this and then reach their ORCID profile. And in doing this, they're actually double checking our work so they can also tell us no guys, you're wrong. So you assume this article is mine, the ORCID ID that is specified here is mine but is wrong. So these will actually help us to build more trust around the graph and more precision. Okay. So populating the graph. The idea here is we harvest, okay, but we don't, we're not harvesting as we used to do and as many others are doing today. The basic conditions under harvesting are typically assuming that all objects coming from a data source, especially in the scholarly communication domain are of a given type. For example, you collect from an institutional repository. So what you get are publications. You collect from fixed share what you get is data, is data, right. You collect from GitHub and what you get is software. This is generally not true. So I would say the 100%, no, but maybe very close to 100% of the repositories out there or the archives are hybrid. So we have several hybrid repository, especially in the public domain. So when I collect from an institutional repository and I will likely find software data and other things, not only articles, okay. So what we're doing in open air, sorry, is to consider every single source as a hybrid, potentially hybrid source. So we do a fine-grained classification. So when we collect an object, we know what the original source, we know what the resource type and we try to map this resource type, given the original data source typology into its specific class. So into a publication, a data set or software. This is quite challenging and we're improving it constantly. We have a shared mapping between ontologies which we reach every day and we keep up to date every day. By analyzing the original metadata, for example, identifying the resource type, we haven't yet mapped into one of the meta classes, publications, data software, et cetera, and keeping this mapping up to date. The other, let's say, methodology that we have put in place is the one that I mentioned in the first slide at the beginning. So we are trying to dig into the research infrastructures in the workflows, the experimental workflows they have that scientists use and try to bridge their services, so their thematic services, in order to make them seamlessly publish into open air the digital products they produce. So here is an example. On the left, you have a thematic service that takes the data set and the method plus parameters and produces a data set. Today, most likely, the data sets will be published manually by the author if the data set is mature enough is the author is happy. The author will have to publish it somewhere in a data repository. I don't know if he shared and link it to the article. Okay. In this, what we call a continuous publishing procedure, the thematic service has been modified in order to publish on behalf of the author, the whole experiment. So if the author is happy, what is the result, then by simply ticking an option, as a consequence of this, the thematic service itself, we publish into Zenodo everything. So a representation of the experiment. So a file that represents the experiment itself so claims for the services which were the parameters the time of execution points to the input data set method. If these are published, otherwise also this will be published and points to the result. And the result and the data set will be also published as an independent object with an inverse link to the experiment object. In some of the experimentations that we are performing with EPOS, for example, the digital object representing the experiment can be actually fed back to the thematic service in order to re-execute the experiment. Okay, so the thematic service is also able to interpret this digital object and set itself in order to repeat the experiment because this is what they wanted to do. So if the thematic service does that on your behalf, of course, for the scientist is much simpler, the process is facilitated and the likelihood to have everything published out there for the repetition of the experiment is bigger. And of course, once we publish in Zenodo, Zenodo is into the graph, the open resource graph, so we will be visible from there. Okay, that's the experiment I just mentioned. Now, we're not collecting all sources in the same way. And that's because some of them are huge and it does not make sense to collect them and include them in the graph altogether. So we do a lot of preprocessing, especially on the side of links, article data set links, and Crossref because Crossref is the largest collection we have. These two collections are available today. Okay, we publish them every six months, although this is a lie because we didn't manage to publish others every six months. But that's because not because we don't want to, but because it's really in some cases, the deadlines and the kind of activities we are doing. It's really hard to be up to the expectations, but new releases will be out soon this month. Skull Explorer is basically the largest collection of authoritative links between articles and data set worldwide. So we are collecting data site, the whole data site, so links in there, links in the event data, which are the ones coming from the publishers towards the data set, and links from MBOLIBI. So we're collecting the whole collection of links. And we are putting them together, also the ones from OpenAir, because some of the links actually come from the repositories or from the mining that we have. So we have 400 million links bilateral links. So it's 900 and whatever, 60. Okay. So it's a lot of stuff. And this is, of course, an open service. So you can actually use the API to resolve links. So you can send to the API a DOI and get back the list of objects linked to this DOI, or you can send a PDB and get the back, the list of article DOI is linked to it. This is a service used today by old publishers, most publishers, by Scopus. So it's the same, very useful for the business as well. And of course, you can download the collection and play with it. Do your experiments. The DOI boost instead is the Union of Crossref, Unpayable, Microsoft Academic Graphs, and Orchid ADs, and the Orchid. Okay. So the idea is that we start from Crossref as a P vote collection. And we attach to it all the open access version coming from Paywall, all the richer information coming from Microsoft Academics, for example, the subjects, the abstracts, the affiliations, the Orchid ADs, the Microsoft Academic ADs, etc. And from Orchid, we attach the Orchid ADs to the Crossref. Okay. So since every Orchid AD has a list of publications, we basically build the inverted index. So for each publication, we build a list of Orchid ADs linked to it. And we merge it with Crossref, obtaining actually quite interesting results. We enrich, of course, Crossref, the number of Orchid ADs in Crossref, thanks to this process. We have around 85 million publication records, and that's due to the fact that some of the publications in Crossref are quite poor. So we exclude them from the processing. They're completely missing data, dates, or authors, or sorts of stuff. Context propagation. So once we build the graph, using the original metadata that I just mentioned to you, the original links, the original metadata records, the duplication, and so on. We're not over yet. We're not finished because there are other interesting things that we can do. We can actually propagate in the graph information. These are just three examples. Okay. So for example, the top subgraph that you can see, if I have a project, and I know that a product is funded by this project, then if the product is linked with a supplemented by relationship to another product, I can easily include the association to the project to the second product. For example, publication linked to a data set. I know the publication is linked to a project. If the publication is supplemented by the data set, then the data set is associated to the project. And I can follow up with this kind of conclusions also with organizations, for example, right? That's pretty easy, especially if I know if the authors are the same between the two products. I can propagate the country. And that's another interesting perspective. I can actually tag every single product with a country tag, and I can propagate this country tag, and therefore obtain implicitly an aggregation at the country level. Okay, so the chunk of the graph that links all objects to the country. That's very interesting because it goes actually beyond the notion of national aggregators that we have today. So today, if you want to be the national aggregator, you collect content from the repositories, right? But there's much more beyond that. For example, in open air you can find old products which are linked to a national project. Okay, these are implicitly associated to the country. And through propagation, you can really broaden the number of objects linked to a country. The same you can do with orchid IDs. You can propagate orchid IDs. So following the example on top, if you know that a product is associated to a list of orchid IDs, and is linked to another product where the author names are the same, there's a probability is very high that the orchid IDs can be propagated, especially if all names match the names of the first products. Okay, that's typical in the relationship between publications and datasets in many domains. The author set is often the same. So thanks to these context propagation mechanism, and of course, including a level of trust, because the trust is not the same whenever we do propagation, it lowers and goes down. We can enrich the graph and make it even more useful for the world. Today, you can see two graphs in open air. One is the one in production, which is under explore.openair.eu. And this graph is smaller than the one we have in beta. And that's because we are showing the open access subset of the graph. And this is due to the fact that open air since recently focused mainly on open access content acquisition policies. So we were quite strict on that. In beta, what you see are the open science content acquisition policies, new ones published in Zenodo, if you want to take a look at them, it's a document. Explore everything is included. So the numbers are much higher because of that. And in order to access it, you have to go to beta.explore.openair.eu. Just add a beta dot in front, and you obtain our beta services. We have a number of layer zones going on with Microsoft Research, with Unpaywall, with Orchid, we're becoming members. And we also have several activities in RDA, where we are discussing how these graphs, so research graphs in general, no less graphs about research should be noted to exchange information. We strongly believe in this. I mean, this is the mission of open air, right? So it's very important for this graph to interoperate because behind each of these graphs, there are skills and knowledge that can contribute to each other missions. Okay. It's not possible for one single initiative to achieve everything out there. So we need to cooperate. And I really believe that this is a very special season for scholarly communication because we are moving to an open science framework. And since we are coming from, I think, a dark era where the giants like the big publishers were basically dominating whatever we were doing, especially in terms of sharing scholarly communication, evaluating scholarly communication, I think today we have a chance to take it back in our hands, right? It's clear the fact that the same guys, the same big guys, want to move over. So they're moving towards data, they're moving towards reproducibility. I think we should not make this happen, okay? So collaborate with these guys because they're still important, but making sure that whatever we produce belongs to us and making sure that we are the ones who should actually validate what is valuable and what is not. Okay. So these graphs are actually key for us, for the researchers, for the future of scholarly communication. And we should intend them as such, okay? We should be patient and build them together, improve them together. This is why we have now started an open consultation from the graph. Actually, it's November. We hope to be October. So I had to update this slide. And you can find in beta.explore.openair.eu a link to a Trello installation. Through the Trello installation, you'll be able to check several things. On the one hand, the roadmap. So what is going to happen to the graph, what is happening, and what has happened to the graph. So the kind of enrichment we're going to and improvements we're going to process in the near future and we are currently processing. On the right side instead, as you can see, you have ways to provide your feedback. So we thought of organizing this in terms of the entities. So you have the publications, the data sets, the software, the other research products, the organizations, and so on, to go to the right. And after them, you can give us very detailed feedback on what you found wrong or as bad. And this will help us at improving the graph and make it a better object for our own good. So, for example, you can write a comment on a specific activity here is metadata errors in publication metadata. Okay. Very, very important for all of us. Okay, so thank you very much. My final remarks were the ones that I just mentioned. So it's not because there's none, but because I wanted to really recommend that in general as scientists we should pursue this logic of sharing and transparency. So we should not rely on these mechanisms that that exist already, or on new ones that publishers may provide. Okay, so thank you very much. So if you have any questions. Gwen. I think there is a question the question and answer box. Okay. Yes, I can see, I can see one. Okay. So. Yes. That's a long story. That's a long story. So they're not there today. They will be soon. And that's because they what you can find today in terms of citation because we have citations are the citations that we have inferred from the articles. So in the articles, we identified the bibliography at the end of the articles, and we actually identify the elements of the geography that have a match in our graph. So you will find a link from the bibliography to the objects in our graph, but the bibliography is intended as, let's say, as a property of the record. Okay, so these are not intended as relationships as the rest. So it's, it's not integrated as a normal relationship and it cannot be browsed as such. These will become relationships very soon. And we provide this relationship to open citations, and we'll include the relationships that we already have from open citations in the graph. Today, you can find links to open citations in every single record. So if you if you go to a record you can find a link to the citations in open citations. The next phase will merge these contributions from our friends of open citations, and we'll include our contributions into theirs. So the graph, yes, would include everything that is a citation and that is openly available. Okay. The question was regarding the links products to product is this include in citations. So that was my answer. We partly have them, but again they're not represented as citations but as properties. And very soon we'll make sure that we'll become citations. You're welcome. Any other question. Don't be shy. I'll be here anyway so you know my name. And. Okay organizations institutions as affiliations. Okay. We have we what we keep our links between products, say, a data set and the organizations of the authors of that product. This is key, for example, to measure the research impact of an organization. Okay, and that's the next step. How do we do that for several reasons. So why don't we include the author in the chain for several reasons. The first one being that metadata doesn't in general provide that. So data site does, but the majority of the metadata we collect from regular repositories like institutional repositories thematic repositories from the publishers doesn't include that. The second aspect is also because via mining is pretty easy to find an organization in the header of a paper. It's harder to find the organizations linked to the author links to the link to the to the paper itself. So we decided to go for this in the beginning because it's a pragmatically very useful and across the different use cases. It can be acquired as a piece of information. The API already working for the beta version. I leave a lesson to reply that I think we only have dumps, which are being increasingly produced. Data API working. We have. Yes, the way I PMH API of beta, which are working. I'm not sure if we have added the link in the documentation page for that. However, I would like to strongly suggest to use the the dumps that are available on Zenodo, because the graph is very huge. So I think it's better for for developers and users to get the dump on work directly with the dump. Yeah, but I will make sure that the link to the IPMH is available on the documentation page. There's another question from the adranca, which is do you plan to link organizations with funders? Well, funders are considered as organizations of a special kind. I'm not sure which kind of links you would like to refer to. You mean projects maybe rather than funders. So today, funders are linked to funding streams. And of course a funder the commission itself is an organization. They have a jurisdiction. So we know they refer for example to a country or to Europe or to us. So we have 29 funders today. And for each of them we collect the projects. And for each project of each funder we try to mine into the publication to find the links. And we find I think around 400,000 links. So it's a lot of stuff in terms of enriching the original graph. This information is not available of course in the metadata in the 90% of cases 99% of cases. And it's interesting also to, for example, to identify double funding. Well, quadruple finding, tentable finding because we have cases where we have like six, seven funders for one article. And it's quite interesting to investigate. Antonia Correa, we use grid. We use grid identifiers, but we go beyond. The problem, of course I couldn't go into every single detail, but the problem is quite challenging because we collect organizations from different sources and they tend to use different IDs. Okay, so the commission use peak IDs. And since we are serving the commission, we have to keep the peak IDs. When we collect organizations from open door, for example, they have their own understanding of what their organization is. And the same holds for other sources for all funders in general, every funder has organizations inside and they have their own understanding. So we had to deal with the duplication of organization. And that's pretty hard to do. So what we did is to and we are doing it in this very moment. So today, if you go to the website, you will find a duplication, right? So and the duplication is not necessarily correct all the time, but the hardest part is that it's not stable. So every time we run again, the duplication, this may vary, right? So what we're building today is a database that stabilizes the results of this duplication. So we store somewhere the results of the duplication and we curate them. And then this database, they say the validated part is included again in the duplication and always wins, right? So we can only enrich it. And we have a number of curators, namely the nodes, we'll actually give them an account so they can work at the national level to fix actually to cross bridge different IDs. So ISNI is already there because it's part of grid wringled as well. And they will be able to bridge it to pick IDs to open door IDs and so on. And this database, of course, will be public so anybody can access it and take advantage of it. So the format in the data is the format in OpenAir. Well, we expose it today in different ways from ranging from XML to JSON and it obeys to a data model. And this data model can vary depending on the formats and is always available. We export the links, Skoll Explorer as a JSON that obeys to this Skollix format, which is a format defined in RDA for the exchange of links. And when we export the full XML of OpenAir, these XML based to metadata format that is of course made available and described together with the collection. Well, we also have linked OpenData. Think OpenData is available, but it's a subset for the moment of the graph. And we'll make available the whole set thanks to work that is being done in Athens, which is University of Athens, to provide a scalable technology for linked OpenData. Because you may understand that the graph with what we have today almost 200 million objects expanded in linked OpenData can be quite compelling. We already have the SparkQL endpoint, so you can go to lod.openair.u. And as I mentioned, this is only a subset for the moment. Because VIRT was a dozen scale up to our numbers. But again, Maria, we are trying to do better. So we are actually developing technologies to make this available, fully available as linked OpenData. Okay, so thank you all really. It's really important that you showed up today and you had all these questions and really keep on asking questions and send us your complaints, send us over your doubts, because it's actually very important. We want these to be ours. I mean, in the sense of all of us including you. Okay.