 Hi everybody. Can you see the presentation? Yes, we can see it and we can hear you. Great. So thank you everybody. I see lots of people attending the webinar today so thank you for being here. So what I'm going to talk about is as anticipated, it's something about the Opener Research Graph and I want to start with saying that one of the main activities we do, one of the many activities we do in Opener is focused on the construction of my son. My son is not. Is there some problem with my headset? As I was saying, one of the other messages is much better. Okay, great. People voted. So one of the main activities, it's about the construction of the graph. So just because I'm not really aware about what's the background of the people that are attending today, I'm just saying that graph is a mathematical way to model reality or the aspects of the domain you're talking about. So mainly you have entities and edges that connect different entities together. In our case, our domain is scholarly communication. So the entities we deal with in Opener are the ones that you can see on the left of the screen. So pretty much like even Facebook, you have people that are nodes and friendship is edges connecting different peoples. Here we have different entities that are connected through semantic relations. One of the central, the key entities that we have in Opener is the product. So for these entity and only, we do a subclassification. So we have literature on one side, publications, which are intended to be consumed and read by humans, researchers and scientists around the world. Then we have research data, which is anything that is processed, that can be processed by machine. So we intended that as machine readable, not strictly human readable, even though you can go through CSV and what is your bag? Okay. Yeah, we had a poll. Great. Thanks for doing that. So let me close this. So we're saying research data. Then we have software, which is basically anything that can be compiled, interpreted, run in order to do some kind of elaboration. And then we have a casual category, which is the research product. And this is intended to host all the products, all the research products that are not strictly any of the three categories that I aforementioned. So for example, protocols, workflows, slides, and anything that might be very domain specific, that doesn't go, is not broadly accepted as a well-recognized entity across science domains, across disciplines. And then you can see other familiar names and projects funded by certain funding bodies, certain funding streams, organizations participating in research and communities, which in our case, it's something we introduce it because we work mainly with and for communities of researchers that can be, for example, thematically driven. So for example, I don't know, marine research or bio biology community, or we also have research, research infrastructures, which are community themselves. So for example, EPOS, the European Plateau Observatory dealing with the seismology and earthquakes and other aspects of geology, it's again, it's itself a community. So that's okay. So in order to build this graph, because I mean, as researchers, we daily provide some pieces of information to build to these graphs. Every time we deposition a publication on a repository, every time we go and we release our data to be openly accessible, every time we claim that certain has been funded by a certain project, and then we are affiliated to a certain university rather than another one, we are, you know, like seeding pieces of information that concur to building this graph. And in order to materialize it, what we do is just harvesting. We do that in a couple of ways. On one side, we go through what we call institutional repositories. So like any open access repository that's exposed by universities or other research centers. And on the other side, we have thematic repositories and repository repositories that are that are run by by research infrastructures, for example, and or other other infrastructures that are run cross community like Zenodo, which is an architectural repository developed by CERN, or other infrastructures, as I was saying, APOS, or Daria, which is for cultural heritage, and Elixir, which are just a few examples of research infrastructures we deal with. And when we get all this raw material, raw information from from the harvesting process, we have a bunch of processes that are that are concurring into the progressive enrichment of the information space. We have duplication that tries to put, right, tries and succeed to to put together records that are relating to the same to the same item, same publication, for example. And it does so to have a better representation of what of all the information that's around. So it doesn't it doesn't overwrite anything, it just tries to put together all the different descriptions, for example, if it has a publication has multiple titers or different authored lists, or different identifiers that applications try to put everything together and have a better representation of the record itself. We have end user claims and user feedback. So like, you know, we register users, users on open air that come to our website and claim themselves pieces of with pieces of knowledge that this publication can be related to this funding. And this author, specific author as as participated to the to the writing of a certain paper and so on and so forth. We have a number of mining processes that try to infer missing links that potentially might have gone lost in the original source of data, but that still can be inferred by running some NLP. So NLP, sorry for the acronym, but it's a natural language processing over the PDF. So, so that you, for example, you can infer link missing links between articles and projects, articles and institutions, and so on. Once the graph is ready, it uses fuel to to serve certain applications which which are under the shield of open air, namely monitor, connect, explore and develop. But this doesn't mean that can fuel other applications, which are the third party. In our case, in the ecosystem of open air monitor, it's for monitoring, for example, research impact. So it's used by by funders in order to understand how the initiatives they are and the projects they are financing. They are going throughout time. We're just in a nutshell. I mean, it's much more complex than that. But just to give a brief overview, connect, it's for giving feedback to communities and providing a slice of the graph that's being customized by their own view. So as I was saying before, like EPOS works with geology and seismology. So they can have a preferential view of the whole open air graph with only results that are pertaining to this community. So it gives stats and customized search that just, you know, like tailored for the needs of the community. Explore is just is the search engine that you might know already. So you just go there and it's like Google, Google scholar, but running over open air research graph and develop is is intended to serve through API codes, the needs for all the developers that might want to, to develop and to query in a programmatic way the information space, the graph that's produced by Johanna. Also, what is going to be part of the European Open Science Cloud as a scientific product catalogue. So it's something that we are pushing for and hopefully in the near future, like the entire open air research graph is going to is going to take part of the ESC core services as the main scientific product catalogue. So like these are the key properties of the open air research graph. The main one, it's that is intended to be open. So all the material we get, we release it with CC0 royalties, CC0 right. So it's like everyone literally can can get into whatever they prefer to do with the with the data we expose. Some parts of the material we have and we distribute cannot be redistributed with CC0, but they are rather distributed with CC buy. And this is because we get them with this more restrictive license. So we can, we couldn't possibly redistribute them with CC0 because it would be more, you know, like releasing constraints, relaxing constraints, but for example, all the information we get from Microsoft Academic Graph is distributed to us via open data commons by license. So it cannot be redistributed with CC0, but still it can be redistributed by us. And you can use it, provide it that you reference the Microsoft Academic Initiative back. One of the other, the second property of the open air graph is that it's complete, or we have been trying so far in the last decade to have it as complete as possible. So you, I'm pretty sure you can find here like plenty of logos that you know. It's like, but yeah, we get, we rely on trusted and like in a scholarly word, trusted sources in order to get the information we need and in order to construct the whole graph. And you can see here like the sources that are devised mainly for software, other sources that are mainly devised for providing scholarly metadata and bibliographic records, other ones that are completing all the information regarding authors, read-through data, and cardies that are cardies mainly for projects and read-through data for providing like a reliable list of data sources and data providers, and so on and so forth. And also like on the right, thematic and community-driven, community-specific repositories and data sources. As I was mentioning before, it's the duplicated, which means that again, we try to provide better representation of a single record by harvesting and putting together all the information that comes from different sites. There are, I put a couple of bibliographic references that explain in details the deduplication process. So if you are curious to see what we do and how we do that, there's like plenty of information, it's everything documented, and you can check one is a thesis and the other one is a poster. So there's an overload of information over there to check and in order to disclose which are the internals of the deduplication mechanism. We do deduplication for mainly scientific products and organizations. So publications like that are merged together and organizations, because organization can be manifest through, wait a second, I don't know why. Can you still see the, okay, yeah, because I think I accidentally stopped the screen sharing. So I was saying like the duplication of organizations because they can appear with different names. So we don't want to have, for example, our monitoring application to that to provide different results for the same university appearing in different with different variations. We want to have all the all the indicators merging to one. So that's why it is a key part of our monitoring infrastructure, for example. It's intended to be participatory because anyone can take part of this can be, you know, inserted in the open-air loop. So anyone that really wants to provide material to open-air is welcome to do so. And the same way anyone that wants to consume open-air research graph is invited to do so. It's transparent because for every information we have in open-air and we provide, we provide provenance information stating where the information comes from and what are the the real reliability indicators and trust for the information we obtained from mining. So because of, because the mining can come with a threshold of uncertainty, we always burn this kind of information within the, we level this information together with the piece of knowledge that has been inferred. So because in principle you could say prune the whole graph with anything that's below a certain threshold of trust so that you, you know, like you restrict your observations to only something that you retain of high quality for your application. It's decentralized because like our philosophy is that in a case open-air hopefully not but cease to exist whatever we produced in the last 10 years cannot disappear because it would be just a shame. So the main idea is to redistribute and recirculate everything we do with other research and initiatives that are building knowledge graphs and scholarly, scholarly graphs for example the three that you can find at the bottom of the slide and we also redistribute whatever we find to content providers that are providing us content. So the brokerage service which is part of the provide, let's say application, provide set of services that are part of open-air is one service that is intended to redistribute back to the content providers any kind of enrichment and missing information they might be interested in. So the content provider can subscribe to certain events and whenever for example we find that the original record was missing a DOI or had a piece of information that has been enriched by our processing and our construction of the graph we can notify back in a very punctual way so every record is going to be notified back. We can do that for every publication for example and notify all this new information and enrich information back to the regional content provider. And again it's also trusted because users are considered in the interloop so we have user claims as I was saying at the beginning so anyone registered on the open-air portal can claim pieces of information, can claim, can state that something has to be merged together like two publications are when not duplicated properly I want them to be together I can claim that if I discovered that two publications on the contrary have been merged and they are not supposed to I can claim that again I can I can claim a split if I if I know that a certain publication has been funded by a certain project I can do that so all these it's like ultimate truth and it can be it can be fed into the into the creation of the of the graph by end users. Also from December we are part of the ORCID I mean we're a member of ORCID which means that not now at the moment you can only log in but in the future we are going to to link open-air and and ORCID in the sense that you can you can send publications that you find on open-air directly to your CV on ORCID so you can curate and and reach and claim all the publications you and data sets and projects and and software and other other kind of products that are present in open-air and send them and and curate your CV on ORCID. So tabulating the graph so we do we do this in a kind of different way from from what we've seen in other in other similar initiatives so we have like every source of every kind of repositories that we harvest from is treated as a hybrid source because we notice at the beginning we were like if we harvest from a institutional repository we were thinking that pretty much everything was like publications and well turns out that turns out from experience that's not really the case so we learned on our own expenses that it's better to have to consider every repositories we harvest from as a hybrid source so we have in place some mapping mechanisms and that try to classify every single record that is being harvest into one of the four categories you see here so publication data sets software and other research products. The mappings are public you can you can have a view and then if you follow the documentation on the website and this is something really peculiar that that that we do another thing that that it's it's implemented by open-air it's like these the automatic bridging of research infrastructures and scholarly communication so what you traditionally have for for research infrastructure is what you see on the left so a research infrastructure running an experiment with certain methods certain settings over against certain data sets with thematic services and producing possibly new data sets and so on. What we do in open-air is just is together with I mean with a strong liaison with the with the infrastructures and research infrastructure we managed to plug the scholarly a better scholarly communication process right into the infrastructure so that whenever they run experiments they through open-air connect they can push automatically all the all the parameters and data sets and input methods and data sets produced anything that's that's worth mentioning into Zenodo and which means that the experiments the experiment they have just run will be reproducible and transparent for all the community and this is something that was missing because at the beginning when only the left part was was present without the open-air connect effort the scientists of the research infrastructure or the infrastructure had to manually sit down and and push the records onto the onto the pertaining repose repositories which means that seldom because it was an automatic this that seldom they would do that and in this automatic way instead we just we realized that and in communities as well and infrastructures as well they realized that it was much much better and actually improving open science and transparency and reproducibility of any research effort they were they they are carrying out also one one Zenodo is one any kind of record that comes from from from research infrastructure is fed into Zenodo that it also is collected by the open-air harvesting infrastructure and fed back to the research graph which means eventually will be will be processed and take part of the new version of the research graph this is just an experiment for the EPOS infrastructure but it's exactly what I was telling you before so how the research infrastructure automatically materialized and if any any experiment every experiment actually and push the records onto Zenodo so not every source that is a that takes part of the of the graph it's it's harvested there are certain a couple actually of big sources that are pre-computed pre-processed one is collect lawyer which deals with publication data set links and this is done offline I mean while open-air has an harvesting service and that runs periodically and and updates everything both scholar explorer and DOI boost they are done offline because they have to ingest huge quantities of data so in this call explorer is we constitute of four hundred and eighty bilayer links and it's it's all any link every link is between publication and a data set so it's it's actually one of the biggest collection of these of these it's not the biggest one of this kind and it's been used by scophers for because we have an API so they literally hammer our our endpoint in order to resolve DOI's and see for example which are the data sets related to and vice versa the other one is DOI boost which is in a nutshell like a in a rich version of crossref so crossref I suppose you know it can be harvested from from a public endpoint what we have done for in DOI boost is basically um try to harvest also another handful of sources namely Microsoft's academic graph unpayable and orchid and inject information back into crossref data so that so that is enriched for example uh author author's IDs or um um links to open access PDFs uh authors affiliations uh citations and so on we have at them at the time being 85 million publication records which if you know the numbers and the figures in in crossref it's a bit lower this is because uh we are we are trashing certain records that do not match uh our you know like minimal uh quality requirements because they do not they don't have titles they don't have authors in that case we trust them because otherwise crossref itself it comes something like more 100,000 100 million sorry publication records we try to do that both both the data sets are pushed to Zenata so you can go there look for either scholar explorer or DOI boost and you will find dumps we try to update them every six months even though it's a bit of a stretch because the process is quite tricky so that's that's more a name than it's more a promise than than reality but yeah let's say that every six months we tend to we tend to release new updates uh so like still on the on the side of all the processing we do on on uh while constructing the graph uh it doesn't end with mining and and and the duplication and user claims we also have information propagation which means that we use on on certain subsets of of the graph we use logic uh chains in order to propagate certain information from one entity to another very briefly for example if you have a publication that has been um it's related to a dataset either because it reused the dataset or is it produced the dataset and you know that the publication is uh funded by a project then you can say that um that the dataset that has been produced by the publication is also funded by the project it's something very it looks naive but it's it actually improved lots uh the recall when you search for when you search for things and when you navigate through the portal and is this the same happens for example when you when you harvest uh a publication from a certain data service which you know is is of uh interest or it pertains to a certain community then you can say that that product is associated associated to the community as well because it's been produced by but as if it's been produced by the community itself and the same thing happens with countries because you you might uh if you know that a certain organization that is resides in a in a certain country and participated participated to a certain project uh funded funded a certain publication then you can say that the publication it's is uh associated to the country as well so these are examples of of the logic propagation we do uh for certain fields for certain pieces of informations around the entities of our graph um so here i just want to capture uh the the difference in the figures so um in open air at the moment there are two souls one is the production one which is the one it's accessible to everybody and and and there's also the visa one in the meanwhile what the main difference of the numbers is because the context access policies changed so we moved from an open access solely uh content access policy to open science content access policies so which means that the graph that you can see in production is much smaller because the the main premise was let's let's collect open access uh material while in the beta infrastructure we are moving to um open up open science which means we get anything from everywhere um because we want to have the best picture as possible uh of the whole landscape of in research at the moment so as you can see here from the numbers like we are harvesting from more 10 000 data sources we have uh 340 million records and more uh are coming from when once base uh will be integrated in uh in our pipeline we have roughly 12 million publication full texts and 960 million of links uh between objects is uh we count a number of liaisons around we have uh one with Microsoft research which is providing as the research graph the uh the agreement is is finalized so we it's um is inside a loop but is not up to date at the moment so like we are finalizing in order to have uh monthly updates of the monthly pushes of the micros of Microsoft research graph we have ongoing applications uh liaisons and collaborations with um paywall and orchid and also we we we participate to an interest group in the research data alliance for an open science graph for fair data and we collaborate with a number of different projects and we exchange information and expertise and and data sets back and forth uh so at the moment we are in an open consultation phase which will run for some other time before the the graph is premier is promoted to uh production which is due to happen in in spring uh if you go on this on this uh link here bet explorer open area you you will find a button and that leads to a trailer board and uh which is our main tool for for gathering feedback from from you guys really so if you have anything to suggest us please uh go there comment uh provide feedback uh throw ideas and we will consider anything you you you write in the next uh in the next release in the forthcoming future this is uh how the board looks like so there are different sections uh different cards and you can comment add new cards and uh and ideas and so on and that would be all thanks for listening i've been maybe i hope i wasn't too long it's 40 minutes so yeah if you have any question i'll try to do my best to to reply okay thank you andrea and there are two questions on the q and a so maybe we can start with that to get conversation going just a sec so would it be possible to make every is everyone reading the q and a oh best read it out loud andrea okay yeah andrea would it would it be possible to make it more transparent when data is harvested from different providers uh what is used from microsoft's academic graph for example um so uh as for as for uh this question i would say that so microsoft academic graph takes part in the in the opener pipeline when when we build doi boost there is a publication uh there is an article on the deposited on zenodo that uh that talks about in details on what are the informations that are uh merged into crossref uh from from microsoft orchid and unpayable so if you go on on if you read the publication you will see plenty of details then of course uh once doi boost is constructed and fed back and fed into the opener graph all these formation all these information flow in the in the opener graph construction so you know like the it it it induces ripple changes uh the other question is where can i find the provenance information reliability indicators um in a dump in a dump um um for example again um when building doi boost uh when we say that a certain author has a certain identifier we have a trust and we have a provenance information both labels so if you if you download the doi boost dump you will see plenty of these labels for trust and and and and provenance like where the important piece of information come from if you download the the opener research graph dump you will find this information for example also for links that have been inferred so at some point if you see a link between let's say a publication and a project you will see that has been produced by this algorithm and the trust threshold for for these information being produced is 0.75 so it's is again it's it's something you see in the dump okay so uh i in congrats thank you which is the technology uh the graph is deployed on uh a graph db for example a triple store have you flattened the graph on a text search server for better search experiences uh the graph so i know to some extent it can be misleading because when i when i talk about graphs i tend to think about um rdf uh all these kind of you know travel stores and and and this kind of uh these kind of things um actually what what it's um it's it's a graph uh as the web is a graph so you know like the the web is it's a graph of web pages pointing back and forth the opener research graph it's a graph of xml's uh pointing back and forth so every entity so every publication every project so it's it's a uh an xml and it contains links between uh different xml's so through identifiers you can you can move back and forward and explore the graph so there's no triple store really there is a as a as part of of opener there is an effort which is about linked open data but that's a projection it's not the entire graph so there's uh there was a task uh in on in opener uh in the last years that was intended to uh to transform the the graph i mean our information information space and expose it uh as as uh lod and uh but this was it was not for the full information it wasn't mapping everything it was mapping just a subset so at the moment uh what you can find there is not the 100% of the of the graph but just a projection okay so uh why do you have community as part of the graph and not researcher very good question also because i'm i'm very interested in affiliations and affiliations work with uh with researchers primarily um so the problem with researchers is that uh is a long story at the beginning we when we started 10 years ago uh we were harvesting mainly institutional repositories we had no uh we couldn't get any sense from from author ideas because the author these ideas were not there so we only had a name and surname in the best in the best case mostly like just a string for for the authors so uh modeling authors as an entity standalone wasn't very viable and um recently we introduced and and it has it wasn't viable since recently like now that we are uh feeding back into the construction um the microsoft academic graph which uh best effort i would say because they are not solving all the problems but they are they do some some work towards uh towards that they uh they have authors as entities they try to sign uh ids to authors even though it is largely perfectable for example like i was i was i was looking it changed that uh approach changes with seniority was uh high have in in in uh in microsoft i have like six different identifiers because i change affiliation in a like handful of time and uh while i notice that uh senior people they are reconciled better so they tend to have uh one just one identifiers while while fresh freshman uh or early career researchers have uh can have more than one so there are identifiers in microsoft we could at some point um try to model thanks to that uh authors but uh it's not something that we are doing at the moment we we we need to to make it make it make a stance about that and see whether it's going to be possible in uh for coming future uh okay so frank is asking did i hear correctly uh is going is going the graph uh to integrate base yes uh it will be integrate base even though i'm not uh i'm not um taking part active part of this uh in whole initiative i know it's going to happen i don't know in what terms and uh and what is the deadline for this for this action but it's something that's uh that um i've been uh i've been listening for for last couple of months so yeah it's uh it's going it should happen within the next year within 2020 okay so i see the q and a box remains empty for now um thank you very much andrea um for this very useful presentation um as i said at the beginning this is definitely not the end of the consultation session and of the work that we're doing on the research graph so what i would suggest is that you stay tuned that you follow us um on social media you'll hear about any next webinars and consultation sessions and uh you'll all receive email with a link to the recordings uh ones that are available and uh so if you have any additional questions uh feel free to contact us at the webinars at openair.eu or uh andrea directly whose email address is now on the screen um so thank you very much uh for attending and uh i hope to see you soon in one of the next webinars yeah thank you thank you andrea