 All right, hi, good morning everyone. So we're really excited to have you joining us for the third open air graph community call. Today we're going to be going over the beginners kit and how it can be used for data analysis. But first we're going to start off with a brief introduction of the new open air graph user forum by Stefania, to which you can find the link in the chat. I'm going to send that now if you'd like to follow along, sign up or anything. So let me just put that there. Okay, so this is the link to the new open air graph user forum, which we'll be going over. So we kindly ask you to keep your microphones muted during the presentation just to avoid any background noise. And at the end, we'll have about 25 minutes of discussion where we'll give you the floor to ask any questions. So you can write all your questions in the Google doc, which can also be found here in the chats. And any questions that we don't get to today will answer in the documents. So make sure we get to everything. So following today's community call, there's actually the informational webinar on the recently launched Irish National Open Access Monitor. And this is a huge project using data from the open air graph to support the nation shift towards 100% open access landscape. As it starts at 12 CET, we must end five minutes early just to allow those interested to head on over. If you're interested in joining the call, you can still register also following the link in the chat. So it's going to send a couple things here, but that's the last one. So, so if you're interested in joining that, it's a really interesting it's again, it's really big project with open air and the in Ireland. So, so yeah, so with that said, let's get started. So I'm going to pass the floor now to Stefania who is going to present the new open air graph user forum. Great. Thank you. Let me share the screen. Okay, hopefully it's everything good for on your side. So yes, we set up a user forum to have a place for users to take an active role in improving the opener graph. This is a space for sharing knowledge, skills and interest through an ongoing conversation. So let's have a look together. You can access it at openair.flareroom.cloud. And this is the homepage. And as you can see, we've created different categories and subcategories this organization is to help everyone to, well to get the forum organized and meet and make easier for everyone to navigate through the discussion. So the goal here is to have conversations on aspects of the graph that are significant to you. So we encourage you to share your point of view or express your needs based on your role. For example, one great chance that we have is to continue discussing topics that comes up during our community calls. Otherwise, usually when there's a follow up, this is usually a one-on-one interaction with an open-air staff member. So while on the other hand, using this forum, the entire community can participate and we can also keep track of the point raised and follow up on that. So this is a public forum. So discussions are open for viewing, but of course, if you wish to contribute, you have to create an account linked to an email address and sign up. And how can you contribute? So first of all, you can navigate here, you can browse the different categories. So for example, today we're going to talk about the beginner's kit. So beginner's kit is under graph data sets. And if you follow the link here, you can see the discussions that are already there and that are related to this topic. And you can follow that. So for example, before the call, we posted this message about what the beginner kit is. We found out very soon with Miriam. And we asked you if you already made some tests, the beginner kit is used for testing the graphs and your data analysis code on a smaller data set. So if you're interested in follow this topic, you have a button follow here that you can choose. And this means that every time that there will be a new reply to this topic, you will receive an email and you will not lose anything. You can reply directly here. Or if you want to start a new discussion, you can simply start a new discussion and you'll get there. And it will be in this category. Going back to the homepage, if you don't find or you are unsure of where the idea that you have, where it fits here, you can simply start a new discussion. You will be asked to choose a tag. But this is not mandatory. You can bypass tag requirements and press OK and move on with your post. And there will be moderators that will have a look at your post and create a category or put in the right category. So this is a very quick round of the forum. And yes, just an important note that when you post a message, it doesn't go live straight away. There's a moderation by one of our members because this helps to keep the forum tidy and that all messages are in the correct category. For example, if you need personalized assistance on specific technical issues, the forum is not the place to go. But we have a help desk dedicated for that. So just join the forum to ask questions that the community can help answer. Discuss the graph functionalities, share your experience and give feedback. And all of this will be taken into account and will help us a lot to announce the graph and shape also the future of the graph. And that's it. I'll let you play with it after the call because now we have presentation for Miriam. I'll stop share. Okay. Great. Thanks, Tasania. So again, we invite you all to sign up and get the forum rolling with your questions, tips, feedback and everything. And with that, we'll head on over to our presentation. So guiding us through the beginner's kit today is Miriam Bagnolni, open air graph data curator, scientist and engineer. So now pass the floor to her. Thank you, Alain. Good morning everyone. I will share my screen. So today we will see a set of examples that we hope will be helpful for you to start working on the open air data set to reply to your research questions. Brief outline, we are going to see these example queries on the open air beginner's kit. So we will see why we have decided to provide this kit and how we build it out of the graph. Then we will see how to get it and run it. And then a bit of, let's say examples directly on the Jupyter notebook that we use to analyze the data. Then we will see what open air is planning next for the kids and the space for discussion. So last time that Claudio presented these community codes is show the very first part of the pipeline that open air exploits to build the graph, the collection of the data today we are at the opposite end of the pipeline. When everything is done, the graph is ready to be provided by an open air to the user. But this graph is really, really big. We have a lot of research products, a lot of projects, a lot of organizations, a lot of entities and many, many relationships. So it can be very difficult for a user to get the data and analyze them. So open air taught to develop the beginner's kit. The beginner's kit is a small subset of the graph that can be used on regular computers. This is the main reason why we decided to provide this kit because it can be downloaded and can be run on a personal laptop. So it is easier to exploring your research question using the open air graph. For example, which organization, which other organizations that contributed in projects and so on and so forth. We thought it as a trading ground because it has the same exact model that we have in the complete dataset. So even if it is a smaller subset, you can start understanding how the entities are modeled and now they are connected between themselves, which are the semantics that can be exploited to link, for example, results and projects, projects and organizations and so forth. To this, it becomes very useful also the documentation that is linked via the notebook. This training ground will become very useful when a user decides to delve in the full graph. And it is also a safe space to test and refine the pod. You can experiment as I said before on the same data, but you don't have to pay to do it. So you can try as many queries as you want. You can also do things wrong and refine your queries. And then when you will use the complete graph, you will maximize the success rate. So what is in the beginner's kit? We decided to, for this one, select the publications, not the publication, sorry, the results, the publication dataset, software and other research products that have been published between 30 of June 2023 and 29 of February 2024. From the selection of this set of results, everything starts. Because once we have defined this subset, we start following direct links to the other entities that are organization, data sources, communities, projects and so on and so forth. All the entities that are linked via direct link to at least one result are selected to be part of the beginner's kit. And then this set of relation is expanded by including all the relationship between these entities. So we will include the relation between the organizations that are in the set, between the organization and the projects that are in the set, between the results. So this is our subset. This is the subset that we have selected for the analysis, a bit for the numbers. So we have, it is the list of the entities, of the number of entities compared to what we find in the graph, in the graph that is related to the beginner's kit extracted in this period. As you can see, we have a certain number of entities that make the analysis meaningful. But we do not have, for example, 173 millions of publications, as we have in the graph. We have almost four millions of publications in the data set. We have 8K of data set against the 60 millions in the graph and so on and so forth. So all the entities are represented, all the relationship between these entities, but with a smaller volume. This is also to stress the fact that the beginner's kit want to be an example of how the opener graph can be used for the analysis. It won't be the analysis of the entire graph. So just some numbers for that describe the laptop on which the beginner's kit run. I have a neckintosh, two-pointed gigahertz of processor and so on and so forth. Anyway, these are the specifications of my laptop. So it is a regular computer. So how can we get the beginner's kit? We have it deposited as a release in Zenodo and you can go there and download the zip or you can go directly on GitHub and clone the repository. So if you go to GitHub, you will see the readme that will explain which are the requirements that you need to the kit run. As you can see, all that you need to do is install the Docker engine and check that the space requirements for the kit to run is accomplished. We decided to create a Docker because in this way everything that is needed to run the notebook is already loaded in the Docker. The user doesn't need to install anything else than the Docker. So there are the Docker engines. So there are some instructions that guide you through all the steps. So if you decide to get the kit from GitHub, you have to clone the repository. If you decide to get the kit from Zenodo, you just need to download it and zip it. So you go in the folder when you have the repository unzipped and just run this common that builds an image of what is in the Docker file. That means that everything that is needed is collected and made available in the image. When the image is done, you can run the container. That means you create an instance of a virtual machine that contains everything that you need and you can assign to it a name and you have to pass it to the port where it has to connect. You can do also this by using the Docker UI. Here you have the images that have been built. You click on the action here to make it to run it and you will be asked about some configuration that means the port number and an optional name for the container. Here you have the view of the containers that are running. You can decide to stop the container and start the container by clicking on this arrow. If you stop it, everything that you've done, the data that you have downloaded, will remain in the container. But if you delete it with the trash, you will lose everything. If you go and run it again, you won't find anything that you have done until this moment. Let's see a bit of what we have done in the kit. This is what you see when you select the link to connect to the Jupyter Lab. You will see this notebook without the paragraph not run. I run this already because you never know what can happen during a little demo. So just be sure that everything will be displayed. The first part of the kit is the import of the data. Here we have wrote the URL of the model of the deposition that is relevant to the last beginner's kit data set. You can download every data set by just changing the number here, but you have to make sure the model of the data set is compliant with the one used in the kit. We then download it. Running this paragraph will automatically download all the files that are contained in the deposition. We'll untap them and put them where the notebook expects them to be, so you don't need to program anything. You don't need to write any no code to do this. The notebook will do it for you. While he is downloading, you will see what he's downloading, what he's doing, and what it is about time it is in the downloading of the various files. Then there is the first part of the notebook that will import the needed libraries. You could say why we need to have the spark engine and everything to analyze the data? Why don't we just use pandas? This is because it is not doable with pandas because for the memory requirements. You can load the data of one of the parts of the data set and you can see some of the information that it is contained in this case publication. But if you try to load all of the publications, you will get an error and you will need to restart the kernel, so I haven't done this. If you want to try afterwards, you can do it. Spark can help us to analyze the data. What we need to do first is to create a session with this command, and then we have defined the scheme of the result, the scheme of the entities, that is how the entities are modeled in OpenAir. This step is not required. If you want to change the download of the position, the beginner's kitty, you can avoid to run this part and you have to comment the read schema part in the loading of the data. This one is needed because in this way the data read faster, but it is not mandatory. Once we have read the data, we have to analyze them. We can do in a very friendly way by creating temporary views. Spark allows us to do it and they are very similar to real SQL query that we can query by writing SQL and making it execute by a Spark. With this instruction, create or replace them few, we see the publications that we have read will be named as this name when we use them within an SQL query. Just to see some numbers, we see the number of the entities by writing publications count and we see the number of publications, the number of data set and so on and so forth. Just to have the list of the numbers and we can do the same via SQL. You don't need to write a Python. You can also write the same instruction via an SQL and you can display it nicely using pandas. You can display it also with a show, but it will not so nice to see, so we decide to use pandas. These other paragraphs show what's inside an example of publication and it shows, let's say, the model. We have the author, back success write container that is the journal. Within the author we have the list because of the original name. Each author has a name, a rank, a surname. If present, it has also the orchid. In the best of success writes that we have for the result, in this case, it is closed. We have the core code and the schema for the container. We have the name of the journal, the ESSN and so on and so forth. We can see how the publication is done and which are the information that we can find. We can do the same for the data sources and every one of the entities has a link to the documentation. You can go and see what is the exact model in the documentation and it is very, very useful when you try to write your own queries to refer always to the documentation. We have a look at the organization. We have, you can show also all the pits that we have for the organization. This one has two different funders and we have the reward, the read, ESSN and so on and so forth. A very general view of each of the entities that we have in the graph. Now we can start to analyze the data. We have called them exercises but they are very common, let's say, research questions. What we can be interested in knowing, which are the relationship that are in the DOM and we want to group them based on the semantics. This is quite simple because the relation is, as you can see from here, the relation is very simple as a schema. We have a flat representation, so the source that is the identifier of the entity and the source type that specifies the entity, which type it is, and just the semantics that is the name within the real type element. So to execute this query we just need to group by real type name and count from the relation table. So you can see here that is provided by is the most used relation. Then we have the host is a set by that is from between results and data sources, sites and so on and so forth. This one instead is a query that allows you to show information about publishing venues. As I said before, the publishing venues is inside the container element and it is containing the ISSN, E-I-S-S-N and E-I-S-S-N in the name. So just to see what's inside we can select directly this information from the publications because we do not have a venue entity for now, but we will in the near future. So you can see that we have the journal MAMAS ISWA, the JME journal and this is ISSN printed and so on and so forth. This one can be a bit more interesting, count and sort publications by citations. So we need to join the publication with the relation and we want to count the number of citations per publication. So we need to select as the relation type name is cited by and since in this case the relation is between result and result, we need to select the ID of the publication that is cited and then we can group by publication ID and feed value and do as we did before. We count all the publications that site mine and we order by the number. So we see that the most cited publication is this DOI the Dope that has this DOI and this PMID and so on and so forth. It has three identifiers because it is at a Dope record and it is matched by entities which have been collected from entities that can mine for that kind of PID. So for example crossref environment. So since the relationship so far are bi-directional but we plan to make them visible just in one direction, we can perform the same query by using the other relation, the site and changing the element on which the join is done. Instead of the relation source, we should use the relation target for the join with the relations and we have the same result as before. If you are interested in knowing which are let's say the trends in the research, you can see which is the set of subjects that are most present in the publication in this case. The subject in the model of opener is an array of subjects and the value the actual value of the subject is in the element within the array subject value. So first of all we select we define let's say a view with this part of the query in which we say we define a view called terms in which we have all the subjects from the publications for all the subjects for the publications but not just all the subjects for the publication. This explode operation means that you repeat the rows in the table many times as the number of different values that you have in your array. So we will take all the terms that are in the set of publications and we can as we did before group by term and then count the number and we order them in the order. So we see that general medicine is the topic that is more popular than human, then electrical and electronic engineering and so on and so forth. And we just show this but you can think of expanding these kind of theories by exploring which are the organizations that how these terms are split by organizations or by country, of course by following other relations. And we can also make a similar query now showing the journals with the highest number of publications. The journal is in the container element and we collected from the publication table. What can we do to know which is the journal with the most the highest number of publications. We take the name and for every publication that we have we extract the journal. So once we have the journal name we can group by name and count so we have the number of publications for each journal and we can display them. So we see the scientific reports is the one that has most publication and then we have international journal of molecular science plus one and so on and so forth. This one is a bit more complex and we can show the number of projects per organization. So the organization can have the legal short name empty so we need to use this collage option that means it will take the first option that is not null and it will display it. So what do we need to do? We need to join the organization with the relation and by selecting the participant we know that the source part of the relation is the organization and that the target part is the project because between organization and project the only semantic allow this participant. So selecting this organization we have this this relation sorry we have the set of organization and the set of projects that are related and then we select the legal short name or the legal name as the name of the organization. We group by name and we as before count so we see that the University of Cambridge is the one with the most the number of projects and then the University of Oxford and then and so on and so forth. We can do the same on the project side we can search the projects with the highest number of associated results and open air for some of the funders used the unidentified project because we can create we can have a link to the funder but we don't know exactly the project to which it should be linked so by using the unidentified we can anyway have a link between the publication and the funder even if not directly to the project. So to know which are the projects with the with the highest number of publication of associated results what we need to do we need to join the products the the projects and with the relation and select the producers and semantics this type so project produces results and we know that everything that is on the target is a result we will select the funder short name the code and the title and by grouping by these three values funder short finding short name code and title we can see how many publications how many results are associated to each funder we can do it also by manipulating the strings concatenating their values to have just one column instead of three of them and it is just to to to show that it is possible to manipulate string on the on the fly okay this one is always associated to subjects and this this time the subject will be get from a controlled vocabulary we we don't want the scheme of the subject to be keyword so we will remove the subject with scheme keyword we as before explored the subject but this time we join the table that we created with with itself because we want the concurrent subject in publication we join with itself and the joint element is the the publication ID and we get the couples of subject so for example business and business industry go often together in publication as well as business and medic medicine medical specialty this one will show the number of research product the organization so as we did before we we have to join the organization with the result exploiting the semantics is what our institution of of the relation we will use again the collage legal short name legal name to get the first string that is not known and we can count for organization so cis is the organization with the most results and then um and then university of Cambridge here we can show and we show how we can count the number of research products per type per organization so we always we again need to perform a join between this time the result the organization and the relation we take the relation that f type is authorized to show off and have source as an organization and target as a result this time we need to select also the results because we need to understand this type to count for that and we can do it by exploiting this variation count if so we count one for this type only if the element type has value publication the same for data set the same for software and the same for other we do as before a group by organization and we can show the result so we can see that cis it is the one with the most of the research products and it has mostly publication then data set just 88 then software and then mother and the same as well also for the other organizations as well and now you can be interested in now that I know how it is distributed between research types I would like to know the access type for organization sorry sorry to interrupt Miriam um we have 10 minutes left um okay okay okay sorry I am really sorry lost in the day I'm sure no worries no worries no worries um okay um great so I guess we'll start with some questions so first I also want to we actually have a poll for the next community call covering that covers the beginners kit so not necessarily the one that will be happening next week or next month but the next one that covers the beginners kit what you would like to what you would like to for us to cover in the google doc I actually created a section if you choose other where you can write down specifically what you would like to if you don't if none of them what you would like to see so if you choose other you can elaborate in the google doc um so first we'll go um I don't see any hands raised but we did have a question in the in the zoom Q&A so we have um so we have the question if some curation is made after my analysis that can be helpful for me for example more links made by users better affiliation metadata claim of orchid etc will I be notified and how much time will pass between the curation and the apis that I can find in the open air graph website it's quite a it's anonymous so I don't know if who would like to elaborate who posed the question if they would like to say can you repeat it because yeah here maybe I'll put it um yeah here I'll put it in the chat that way you can see it as well um okay so here we just want to see okay so here we have the question if some curation is made after analyzed I want to stress that the beginners kit it is just a small very small subset of the graph it is to get this one at least it is to get confident with what can be done what kind of analysis can be uh what kind of queries can be applied by exploiting the graph uh dataset but it is just a toy this this example to start working with the model to understand how it is uh how the how the entities are modeled and how can put them together to get uh the answer desired it is not something that can be let's say exploited for uh real analysis not the beginners kit open air uh plans to uh provide a copy of the of the graph dataset on a uh cloud platform that can be used to perform analytics so uh let's say the beginners kit can be used to understand how to query the data and then that analysis can be done on the on the real uh graph on another platform right um do we have any other if you have any questions feel free to raise your hand um I saw a couple uh one but it's already been answered by Claudio thank you Claudio um in the chat and in the Google doc um maybe would somebody like to elaborate if you chose other for the poll um first of all I hope is there any problem with seeing the poll did everyone does everyone have access to it so I'm not sure there's no issues yes Baptiste yes uh I choose I chose other uh so probably it's uh because I'm I'm more uh on the data center part so uh maybe that can fall into organization but as a data center source citation of data sets and types of data sets that are cited and type of journals or things like that so as centered from from a data center I think well I don't know if I don't know how many orchids are enough for instance yes okay some data set centric okay thank you um great any I had another question actually at the top so when I logged into the forum uh so I created my account and it works I wonder if there's a plan to have a federated login library like there is in all the eosk uh tools so that we can log in through our university and and then we are authenticated remotely so I don't know if it's in the plan or not we haven't done this yet because this this is an experiment uh but uh but yes we we hope that the forum can grow and uh we'll we'll invest more and and make it more aligned with uh with all the sources in uh yeah that we have an open air and in eosk of course medium yes medium I was I was reading the note done by Baptiste that he would like to have some example in uh spark you in oh yeah this is another another no so go for please no no no I just that that we haven't done it yet but but but we need to improve it and and this is a this is a great idea to bring it to me to to make it better sorry regarding the spark well no no that it is understandable that it would like to have and we can think of making it also in in a different way for the next one so to have some uh version of this one exploit in other techniques well the the first example that you shown in in sql uh are much shorter when you write them in a sparkle and much easier to read even from an external point of view uh I am I know sql and sparkle uh anytime I do sql with joints I get lost when I use sparkle I don't get lost because it is straightforward the join is transparent so I think we can do we can do some paragraphs maybe one in and the following in the following it makes the same in a different way thank you for this great uh and for those interested I'm sharing the pull results now I'm just typing them in the chat um this is no way to share them all directly um so there's that great um are there any other questions so this means that in in in three months we will have will have the echo uh of the communitical based on uh I mean what will be shown in an example on how to analyze uh or data related to organizations that right that is the idea yes great okay and then we will take into consideration also the uh comment of um uh oh my god cut is to be noted I'll be noted in the and then we will uh have it maybe as a new poll option for the next time yeah perfect okay great all right well thank you everybody and especially thank you Miriam for it sorry no worries no no it's never bad to have more information all right so we're gonna wrap up now so those interested in the irish webinar can head on over um thank you everybody again and we'll hope to see you in the next community call and again you can find all the notes recording and presentations in the open air graph website um they'll be uploaded by the end of the week so have a good day bye thank you thanks a lot everyone goodbye