 So, welcome. Tanasis Virgullis will introduce today the topic. Tanasis is our development and open national director of OpenIR, and he's a computer scientist working in Athens Research Center. And today is presenting the OpenIR Graph, Open Research Infrastructures for analysis. So, Tanasis, the floor is yours. Thanks, Julia, for the introduction. I hope that you can hear me well. Okay, so let me also share my screen and go to presentation mode. Okay, I think that now everything is set up. Can you confirm that you can see my screen? Yes. Okay. So, first of all, thank you for the invitation to talk about this today. As Julia mentioned, I have a role in OpenIR as the development operations director, but I am also a researcher in computer science at Athens Research Center. So, today I will try to present you the ways that the OpenIR Graph can be useful for people that would like to analyze research data. And as you will see in my presentation, there are a lot of different types of users for these particular use cases. There are a lot of different use cases out there, not only researchers that they would like to study a phenomenon related to the research production, but also other stakeholders. Okay, so I will try to first set up the background to provide some background before going in the discussion regarding the graph. So, I think that most people that are involved in research activities, they can agree that in the recent years we are experiencing a significant increase in the volume of the produced research output. And I'm not talking only about publications, which as we know are the most widely known outputs of scientific research, but also other types of contributions. Researchers are performing a wide variety of different activities that are related to research, and all these are producing different types of outputs. Some indicative examples, data that can be raw or processed and structured, software, research software that can be used from researchers to analyze data and get useful insights or to reveal some hidden knowledge. Presentations that are trying to provide more information, more information about particular research activities, and of course great literature, which includes project deliverables, white papers and other similar resources. And this increases indeed significant because of various reasons. Of course, the domain of scientific publishing has become significantly larger. The number of journals that are currently available are significantly larger than before. Of course, we have more research institutions and researchers out there working on different domains. We have also this culture, the notorious publisher or Paris culture that applies a lot of pressure on researchers to publish more. So the scientific output, the result is that we are experiencing a large increase in the volume of the scientific output. Of course, this creates some challenges on how to manage all these data and metadata in a way that someone could use this huge amount of information out there. But it is also an opportunity, of course, as you can understand, for people to analyze all this data, this variety, this wide variety of data and get useful insights for informed decisions or to just study interesting phenomena in scientific publishing domain. But of course, this is in theory because when someone was trying to perform a kind of analysis like this in the past, the first impediment that they would experience would be the paywalls. So, okay, researchers are producing a lot of different outputs. Useful knowledge is included there. But until recently, the most scholarly data and metadata were confined within the scientific publishers' data silos. So only those that were paying for access for this data and metadata could make use of their respective data sets. This was a phenomenon that, of course, prohibited the exploitation of this data and all the opportunities and the potential for analyzing them to get these insights that we mentioned. This was essentially slowing down the advancements in fields like Saint Ometrics, which is a field that is trying to analyze the outputs of research. And also, it was resulting in non-negligible costs and the waste of resources for research performing or research funding organizations and other stakeholders that were interesting in a similar type of analysis. And of course, another important problem was that this situation was also hindering the transparency of research and reproducibility, which is one of the corning stones of scientific integrity. And it is considered to be specifically important, especially during an era where, as we said, the scientific output is becoming larger and larger. But then something started to change. And we are experiencing this change during the last years. Of course, in two words, the change that we are experiencing is open science. So open science and all the relative initiatives that are derived from this movement, like, for instance, the initiative for open citations or the initiative for open abstracts, started to improve the situation. If you don't know about these two initiatives, the one is advocating to make citations, citation metadata available to everyone, to be openly available. And the second one is a similar initiative for abstracts of research papers. So in the previous year, we started experiencing a kind of change. There was some interest and some popularity around initiatives like this that were advocating in favor of making a lot of metadata publicly available so that everyone could use them for analysis purposes. And it was not after a lot of time that we experienced the results of these initiatives and of the open science movement in general. All these initiatives succeeded in creating a cultural change that enabled, catalyzed the creation of a lot of different open research resources, or as I named them here, open science graphs, which are scholarly data sources that include useful information about research outputs and related entities like researchers, funding, organizations and so on. So you can see here on the slide just an indicative list of different open science resources, open research data resources like open citations, data site, Lens, Microsoft academic graph, dimensions, semantics, color, open analytics, PubMed and so on. And of course the opener graph on which we are going to focus for the rest of the presentation because it is one of the largest collections out there for open research data and the de facto graph for IOSC that contains research information around IOSC. So since all these resources became available like the opener graph that contain very rich information that can be valuable for analyzing research data so it was a matter of time for people to start thinking, okay, so we now have all these resources. Should we use them? Should we try to use them instead of the restricted sources like Web of Science for instance or Scopus to perform our type of analysis, for instance, Syndometrics analysis. And especially in 2023 I think that a lot happened towards this direction and this discussion was very active. As an indicative example, here there is a typo, it is ISSI 2023, non-24. In ISSI 2023 which was the International Conference of the International Society for Syndometrics and Informatrix, I had the pleasure to have a joint presentation for a tutorial, an relevant tutorial with Andrea Manlochi from CNR about transitioning towards Open Syndometrics using and exploiting Open Science graphs. So this was a very interesting tutorial, having two parts. The first one was an introduction in the field of Open Science graphs and the second one was a hands-on session involving simple Syndometrics tasks using the Opener Graph, showcasing how someone could perform simple Syndometrics tasks using the Opener Graph. If you are interested you can find more information on this slide and all the slides of this presentation are available on Zenodo. But the important thing to mention regarding this is that this was just an indicative case where this discussion about moving towards the open research sources to use for Syndometrics and other types of related analysis was happening in 2023. But of course I'm sure that you have heard pieces of this discussion in other places. You have heard about organizations starting stopping their subscriptions to Web of Science or similar services and advocating on the usage of open sources instead. Now the previous years this was not an option. Why? First of all one related problem was the coverage of all these sources. But recent studies have shown that now the open resources are very close in terms of coverage to the closed ones. And other possible issues are related to the quality of the data that are included in the Open Science graphs. But of course there are a lot of activities regarding that. I think that in the first community call made by Paolo it was mentioned that for instance in Opener we have a line of work involving various activities that we are trying to improve the quality of the data inside the graph. But in general the situation in 2023 started to be very good and even before that. And this is why a lot of people started thinking of abandoning the closed sources in favor of using the open ones like the Opener graph. And to be honest I would like to speak about that because also the timing. So yesterday on April 16th it was officially announced the Barcelona Declaration on Open Research Information. So this is an initiative focusing on this specific problem. The problem that at this point we feel like there is no significant reason why not to use open research infrastructures and open research resources instead of closed ones. So maybe you have seen the various posts and tweets about that yesterday because this was the official launch. This declaration has been signed already by a lot of organizations and the organizations, the research performing or research funding organizations that have signed it, they are committing to some simple things like for instance to make Open as the default for research information and metadata to help and support systems and services that are offering open research information to support the sustainability of the respective infrastructures and to also support collective actions to accelerate the transition to openness of research information leaving behind closed options like those that have been used in the past. If you are interested in this declaration maybe also signing it as an organization you can follow the link from this slide to the website that was officially launched yesterday. And here I have included the different highlights if we consider the reasons why an organization would like to sign the Barcelona Declaration because fair assessment of researchers and institutions requires transparency so this is why we should use open data instead of closed ones because a key decision making requires inclusive data and as you know the open research data sources are pretty inclusive we are trying to have mechanisms to collect as much content as possible and of course because open science requires open research information it is very important to have resources and infrastructures that are offering open research information and metadata so that open science can happen in any case. As I said about 45 foundational signatories are already there and most of them are research performing or research funding organizations and then there are also 15 supporters which are organizations that are providing data services or infrastructures that can be used in this mission and of course as you can imagine open air is one of them. Okay so based on all this you can understand that there is a momentum and a cultural change regarding that. It seems like this is the situation right now. We all understand and if you talk with people in this domain you will see that they all agree that from now on people should invest more on open research infrastructures and resources. So one of them is the open air graph and not just one of them as I explained but one of the largest, more inclusive and of course the de facto research metadata graph that is used by EOSC. So what is the open air graph? I'm sure that for those of you that have participated in the previous community calls you already know which are the main contents that you can find in the graph. Of course the most important entity that is covered is these of the research products and when we are talking about research products we are talking about publications, data set software but also other types of contributions like those that I have presented in the first slide. You can find a lot of such entities represented in the graph more than 175 million publications right now more than 59 million data sets and more than 360k of software packages that are related to research. You can find useful metadata for them like persistent identifiers, like citations, fields of science connections to sustainability development goals and a lot of other things but you can also find connections of the research products too funds, grants, projects we have more than 3 million of them in the graph already people so we have based on the authorship connection connection of researcher to the research products and other entities the data sources that someone could use to get this type of metadata organizations for which we also have pins like the Aurora identifiers and other metadata like the countries from where they come and communities that research groups of researchers like people participating in research infrastructure or in a particular project or domain can create their own community and connect research products and grants to this community. Okay, so this is what you can find more or less in the opener graph why the opener graph matters in any case it matters because it is an open and global map of science it tries to be an open and global map of science and what do we mean with that? We mean that the opener graph is trying to cover not only publications as most of the other initiatives out there are trying to do but to cover a lot of types of research outputs and activities that are currently overlooked like data sets, software projects maybe peer reviews, project deliverables and all other things so the opener graph is housing and meticulously representing all these type of contributions and activities something that you cannot easily find in another data source also the opener graph is inclusive and is trying to contain content from different disciplines it's not specifically focused on a particular discipline for instance ICT's Publement it's trying to cover all disciplines but also all languages in some of the disciplines this is very important because a lot of the research work that people are doing is published in different languages than English we have a lot of text in Spanish, in French and in languages like that for particular domains that the researchers are used to publishing their mother tongue also in the opener graph you can find open science material from institutional repositories and similar sources like for instance open science journals etc that you cannot find elsewhere so these resources are available because opener has on board a lot of different open science related data sources and this is an important thing to know that a lot of the content that you can find in the graph because of that you cannot find it elsewhere finally it is completely open and transparent it is based and built on open data it is using open sources and of course its production is completely transparent you can find details about how every piece of information is produced in the documentation side of the graph and then in the respective publications that we publish from time to time another important thing that makes the graph unique is that it is trying to follow the community standards for interoperability so a lot of effort is given so that we can make it easier for people to consume data from our graph and also combine them with data from other similar graphs or for example for domain specific knowledge bases there is an ongoing effort regarding that to support the SQDAF model that is being produced by a working group an interest group of RDA the research data alliance but also for relevant specifications now something that I would like to clarify because in many cases when I am talking with people that are not very familiar with the contents of the opener graph they have this misconception in mind so although opener's main mission is to promote open science and of course make the respective content easily accessible this does not mean that the opener graph contains only open research the opener graph as you can see in the right of this slide although it focuses on making sure that the open science initiatives are considered and included in the graph it is also incorporating data and metadata from other sources like for instance the crossref Microsoft academic data site etc that are not only covering open science and open research so in the opener graph you can also find research that is closed that is restricted access opener graph is collecting metadata for everything and this is something important that sometimes people do not have in mind so for someone that would like to perform any kind of analysis this analysis will not be restricted considering only open science content and of course opener on top of that is providing some added value some important added value you can see the variety of different sources that are being used on this slide that are being used for the production of the graph and as you can understand as you can assume a lot of these are providing content that is overlapping that have redundancies opener is collecting everything from all these sources and is performing a very difficult task of the aggregating the duplicate and the duplicating the respective information this is very important as we will see also later on in this presentation but this duplication and aggregation is also happening making sure that the provenance of information remains so that someone could delve into the details and see from which source each piece of information was produced and if needed to even disregard part of the information from a source that they do not trust that much so the duplication and provenance are the norm in the opener graph and this is an important added value moving back to the main subject of today's presentation which is the analysis that someone could do with this graph and more specifically focusing on scientific metrics what someone can find in the opener graph they can find basic metadata for research projects of course the title, the publication date the venue and all these that are important but also things like the access right of the respective research product is it open, is it in embargo all these are included in the respective metadata that we keep for research projects also someone can find citations that come from multiple sources from Crossref, from OpenCitations from Microsoft Academic Graph the last version of it before this continue being discontinued etc there are also other citations that are being extracted from text mining algorithms that openers performing can open access publications directly and the important thing about citations is that these citations are deduplicated just because opener has a duplication algorithm to make sure that multiple versions of the same product are grouped together in one entry just because of that then we are also able to produce a unique set of citations removing any duplicates so when someone is using citations and if they are using them as a proxy for an instance of scientific impact that is a very common use if you don't do the duplication there is a problem because the same publication that has been in multiple versions for instance a pre-printed finally published version on the journal each citation to a particular other publication will count twice or multiple times so in OpenAir we deduplicate both articles but also the citations and this is an important improvement making sure that citations from multiple versions of the same paper count only once also I have included here a blog post from Ludo Waltman that I am sure that many of you know him about the particularities that are introduced to the bibliographic database is because of the existence of pre-prints more or less the versions and multiple versions of the papers and if you go there and follow the link you will see that there are some important features that the bibliographic databases should have to avoid some well-known problems and you will see that for those sources that were covered in this particular blog post like I think Web of Science European Seen various others no one is offering this deduplication in citations although it has been identified by the article as an important condition and OpenAir is providing this I think this is very important now moving away from citations the OpenAir Graph also contains some pre-cooked let's call them like this indicators so each entry of a scientific research product it has a calculated citation-based impact indicator a set of scientific based citation-based impact indicators like citation count influence the popularity we get these scores from a database that is calculating this type of indicators which is called BIP BIP Services and then of course we also have indicators of the usage of the research product in terms of downloads and views there is a relevant service of OpenAir that is called usage counts and it calculates this type of indicators for the various research products and these indicators are also incorporated inside the graph someone that downloads the graph or uses the APIs has direct access to all these pre-cooked indicators without the need to calculate something like this everything is ready there are also very interesting connections to the fields of science of the publications, the FOS but also to the sustainable development goals if a publication is related to all in HDGs OpenAir Graph provides this type of connections also connections to research funding and specific grants so which was the project that has funded a particular publication but also connections to affiliated organizations and even countries so all these are very useful pieces of information for someone that is analyzing a research domain for instance or trying to perform a saintometrics analysis and speaking of that an indicative set of examples but of course you can imagine others someone could use this data to perform long individual studies on research production to perform domain analysis for a particular scientific domain to analyze a particular field of interest to see how many publications or data sets are related to this field to perform an impact analysis based on citations for a particular grants or organizations to produce institutional reports for example like the annual reports that organizations research performing or organizations are producing each year to perform a trend identification analysis for different topics for emerging topics for instance or to monitor the open science uptake and of course as I said these are indicative examples you can imagine that based on this variety of data you can also find other ways to exploit them how someone could access the graph there are too many ways the first one is downloading our full data set I have included the links in this slide so that I can help you find all the information so we have a data set it's pretty large so to analyze it you will need to have access to a very powerful machine or cluster so that you can perform analysis on that but all the data that we are providing in this full data set are open also we have some APIs and the main API that we are providing is the search API which is providing some nice ways to download useful information useful content from the graph keep in mind that we are currently at the process of updating this search API the main API of the graph to make it easier to use to simplify for instance the responses and to extend it to cover missing pieces of information based on the experience that we had with some of the users we have approached in the past and of course someone that would like to get the data of the graph and start working on them there is a wide range of supporting material to help them become familiar with this resource I'm sure that you have seen in the previous community call you have heard Miriam presenting the beginner's kit of the open-air graph that is a smaller version of the full data set that can be more easily handled by a local machine so someone could use this to become more familiar with the graph data model there is a full documentation website that you can search even using keyword based search to find information that you would like to know more about and finally last but not least we have a user form and our intention is to build a community to increase the size of the community of the people that are interested in the development, the technical developments of the graph so if you have any questions or any suggestions you have a place where you can visit and provide them so in any case keep this light and it contains a lot of useful pointers for people that are trying to use the graph for the first time of course if all this sounds interesting feel free to perform local experiments on your computer based on the beginner's kit and you can find more information about the beginner's kit in the previous graph community call as I mentioned that medium presented everything in a lot of details you can find the recording and the respective slides online and I have included pointers to that in my slides so that for your convenience and before closing before going away I would like to say that there in this particular community call you will see also various examples on how you can write some queries in a notebook and perform an analysis using this notebook on the graph data using the beginner's kit Miriam was presenting this in a lot of detail so it is a very good starting point for you to see different examples of queries on the graph that are of interest and that can be used as they are or variations of them in particular types of analysis using the graph of course something final to say the APIs are good for someone that would like to get focused information about a couple of publications or things like that but if someone would like to perform a full analysis then using the full data set would be preferable and before closing this presentation that I was trying to provide different perspectives on how the graph can be useful to help you identify maybe use cases that you could test before closing this presentation and giving the mic to you for discussion I have to include it here to indicative applications of the opener graph that we already that are providing useful services to researchers and other stakeholders the first one is of course opener monitor I'm sure that most of you already know about this service so the opener monitor is using all this data that we mentioned from the graph to produce even more indicators and graphs and visualizations that can provide useful insights for research performing organizations for funders for some initiatives like for instance research infrastructures to get some insights about the production that they have the uptake that they have to open science practices or the citation based impact of this work things like that but also others other indicators and similarly there is another application again based on another opener service which is called the open science observatory and focusing on Europe you can see there, you can monitor different data, different facts about the scientific production from the organizations that come from a particular country so this is also being calculated and supported based on the graph these are two of the opener services all of them under the hood they have the opener graph for instance there is also Explorer that is focusing on helping people in scientific discovery so these are real life applications of the graph being used in practice and with that being said I would like to thank you for your interest until now I will be open for discussion about anything that you would like to ask but even if you are right now or you think of something at a later point you can contact me via my email or via my social media thank you again thank you Tanarsis we have already some questions Robert is asking if is it or it will be possible to keep in sync with the full graph for example through a last modified date call on the API for regular snapshot like possible with crossref okay okay so currently the graph date are being updated more or less once each month so once each month more or less we have a new version of the graph that is being produced based on updated information that we get from the input sources the full graph dump the full graph data set is being released openly once it's six months so we don't provide the full data set it's time that we are calculating a new version of the graph however first of all every new version is accessible via the opener services and the APIs so someone that would be interested to get the most recent information they can use the APIs and I'm sure that for people that would like to use the data sets the full data sets in a more frequent way opener as far as I know is open to provide these data sets more frequently in a service mode so after a particular agreement I think this is something that is possible to happen now the second part of this question is related to the fact that it's time that we are publishing the data set there is no easy way for someone to identify the differences so currently for technical reasons we are not supporting the differences also in our our mission let's say is to try to provide at each point a good representation of the information that is available out there not to keep the whole history of the different objects of the different data sources etc so we are keeping the different versions of the graph out there but we are not providing a diff from version to version right now and I'm not sure that something like this will be supported very soon but of course if we have a lot of such requests we can discuss it again maybe inside the forum and see if we can prioritize this as a possible change in the future I hope that I have answered every aspect of this question thanks Tanausis yes also in the chat it's confirmed that you answered so we have another questions coming from Ivo do you also have field to normalize the citation indicators like FWCI or FWCI like Clarivate or Elsevier this is a good question so this is something that we have in our plans so currently we are not offering for instance FWCI which is a field weighted citation index it is possible to get this type of indicators if there is an interest for them I understand why someone would like to use one such indicator but at least for time being what we have selected was to provide something similar to that not completely the same but something similar so we are using the different fields and we are calculating the percentile into which each of the publications belongs for a particular indicator let's say for citation count we are providing a class together with the indicator which is saying that ok this publication based on citation count is in the top 1% or 10% or something like that and I do not remember if we have included this in the graph already but at least for the source that we are using for this type of indicators the big database there at least it is included and maybe also in the graph a field based percentile so this means that although we are not providing the respective number normalized based on the size of the respective domain we are providing an indication so that someone could understand could get an insight about what does this number mean inside the respective domain based on the percentile that you can find for this article now again I should check again if we include this type of field based classes inside the graph they are sure part of the BPDB but even if they are not currently part of the graph we could easily arrange to include them in a next release and of course as I said also calculating a field weighted citation index is something that is not very difficult for us since we are already doing similar type of analysis on the whole citation network it is easy relatively easy to normalize this course based on the respective field so if we have a lot of interest for that this is something that we can provide during the next period and again before closing this answer I would like to mention again the importance of the forum so if you have requests like this and you feel like such addition would be really nice to have you have a place to write about them we are there to hear you and to prioritize for example the development of the graph based on what the community seems to need Thank you Tenasis we have other three questions in the chat can openair be considered as an alternative to web of science for searching research metrics this is the first one let me answer that yes also this is the whole point behind the Barcelona declaration that you can use open resources like the openair as alternatives of course we have done a lot of discussion with people that are experts in the field and there are various things to consider web of science in terms of coverage I think that openair is very close to the respective closed sources in some cases you can also find things that you cannot find there based on different types of contributions for instance or based on grey literature that is not included there but what sometimes is important for people that are performing analysis like syndometrics is to exclude for instance to make sure that the data that they are using are not for instance including publications from predatory publishers and things like that so one point is that one difference that the openair graph and other open initiatives not only the openair graph all the open initiatives do not have a policy on what is being included in terms of publishers so in theory in web of science someone could take also into take to have the benefit to know that whatever was included in the web of science was from a publisher that at least had a particular set of guarantees given to web of science now this has a lot of discussion and this is an ongoing discussion relevant to initiatives like the Barcelona declaration the fact that you have a gatekeeper that is a company to me is something counter-intuitive for something that should be a community driven approach we are discussing about how we can have something similar based on the community of researchers that there is out there but also since everything that we are providing is transparent and someone could see the different sources it is very easy very easy it is possible for people that are performing the respective analysis to apply some filters before doing their stuff and actually this is the way that this thing worked in the past for different types of use cases maybe you have heard that there is the Leiden matrix the Leiden rankings out there for universities and they have recently released a version that is not using closed sources but open ones and to do so they had to perform of course a first filtering step to make sure that based on the requirements that they have for their analysis that can be different based on the use case and only the researcher, the analyst knows about what to use they could filter out some content that was not relevant for them or that they could not trust this much so to summarize the answer in general yes the opener graph is a very rich provider of scholarly information they contain a lot of information it is transparently produced and it provides provenance so you can do any gatekeeping you want when you are using it as an input to your analysis in terms of coverage also it is very close to the closed sources it provides a lot of things pre-computed like Scopus or Web of Science is doing as well and the default version is free which is also important the full graph and the APIs are free of use and you can have access to this information without paying something and of course you can find inside there a lot of stuff that you cannot find Web of Science because it is focused on publications and also publications based on journals for some domains this is a problem because for instance from this computer science domain that I am coming from a lot of people are publishing original research peer-reviewed original research in conferences this is the publishing system that we have and in many cases such resources you cannot find this we need to wrap up the community call I didn't know we had time limit yes yes it is one hour time so we will have we will answer the remaining questions that are here in the chat in our note that was published at the beginning of the chat but you can also find the information in the open-air website you go to support community calls and you search for the open-air community call of the graph here you will find the recording of the previous calls and the notes including also the questions that are here in any case you can use the open-air graph user forum that was posted here in the chat as well to ask more questions and have our answer so the next community call will be the one of our monitor next week and we will speak about the specific indicators that we have in the open-air graph and in the open-air dashboard so see you for all the people that are looking for the open-air graph specific calls next month or next week if you are more interested in the indicators thank you so much and have a nice day, bye thanks