 Good afternoon, everyone. Thanks for coming to the webinar today. We have a talk today on the topic of tracking the footprint of research data across infrastructures using the Research Graph API. The speakers today are Dr. Ben Evans from NCI, Associate Director of NCI, and Dr. Dringel Wang, who's a Collection Manager in NCI. So with that introduction, I actually hand over the talk to Ben for starting the talk. So we're going to be talking about work that's going on to help track research data and how it's used in a broader setting. I should mention NCI's got a lot of partners as a part of this that have been backing and worked with us in this, including from NCRIS and Bureau of Meteorology, GS Arts Australia, CSIRO, the ANU, and a host of other partners and collaborators, including Ann's in particular for this work. So some of the open questions, motivating questions beyond just getting data management in place is, so you publish data and data sets is how is the research community actually connecting with that data? After you put it into a public arena, they could be connecting with it in various ways and making it so how do you track that? And also how do you track the impact of that investment or that research data for other derived products downstream? So that's a challenging sort of question that you can't answer fully within sight of a single center. You're really into an international sort of world and that motivated us a lot to be working on this particular project, which has part of the solution. So I should say that the standing of this work and this piece of infrastructure that we'll be going through on research graph started with a fairly small partnership, but now it's grown quite a bit and RDA Research Data Alliance have picked it up as this Registry Interoperability Working Group and it's got a number of players. You can see some of the players who've been strongly supporting this work over a period of time listed there and you can follow that link on RDA Alliance website to track this. But furthermore, now really through Amir's good work and others, the European Commission have picked this up and said, yes, this needs to now be pushed into an ICT specification. So all that is to say that this work is now on a pretty strong pathway and well worth paying attention to now as it goes forward. So there's four types of what we call nodes in this graph network when you're publishing data and using data. So one is the researcher, one's a dataset, one's publication, one's grants. There could be other nodes as well, but the status of these whole graphs at the moment is basically built up of those fundamental areas. And when we get down to it inside of the tool, you can see the attributes through that graphic on the right hand side. Research is always in green and datasets are in orange publications, blue and grants and yellow. You can see some of the attributes that are listed there and we'll talk about that. The other thing is that this graph network that's been built up understands very well known metadata standards like ISO19115 for that geospatial data, a lot of geospatial data fits into that, but also things like RIFCS that's used in the librarian world and inside of Research Data Australia. If you know that catalogue uses RIFCS and Mark 21 and there are others as well. So just to say that this graph system is already supporting that framework. For NCI, we make a number of major national reference datasets available on NCI. We've curated them and put them into a certain form. They come from our, in principle, from a lot of the science agencies being Bureau of Meteorology and GS Science Australia and so forth. Also sometimes from our research community itself, but they've been classified as really the major national reference collections that are associated with NCI and you can see some of the things listed there, climate weather and satellite imagery, the symmetry elevation, all of these sort of earth systems geospatial data in particular. As an example of a dataset now is, so we've got this thing called Blue Link Reanalysis dataset and on the left-hand side gives you a little summary of what it is. On the right-hand side, many people are familiar and work with catalog systems. So you know, you're using a geo network as part of our core catalog system and so you get, you know, the titles. So that's the blue, you can see on the right-hand side that circle there and an abstract about it. You can see points of contact. So this is all part of this ISO 19115 standard. That's how this is recorded, how to get hold of that data. So the question that you've got of something like this is what researchers are working on that or related datasets, how that publishing is or anything else connected to it. And so you end up with this little graph of stuff and just down on the bottom right-hand side here, just off this sort of basic diagram here, you can see Peter Oake who's the main contact for that dataset is somehow associated with this brand, Blue Link Reanalysis dataset. So it is somehow associated with that even off our local information. So you can find a little bit more about Peter. You know, we have other information systems that have got Peter's details. So what project he's working on, publications, you know, somehow linked to him, you know, his contact detail and a pretty picture there of Peter, looking very sprightly. And so we have that information in NCI. And so on the left-hand side in this sort of dotted line that you can see with the INSA logo around it, we know a fair bit about Peter. That's the number one with the green. There he is. And there he is with his, you know, as a researcher and an identity and attributes inside of our local information. And we know various things about datasets that Peter is associated with. But there's other things that live outside of NCI. And in particular, on the right-hand side there, you can say out in the real world, or out in the external world, Peter Oake has an, what's called an orchid ID, and many of you know this, and inside of, associated with his orchid ID, we know things about his publication record. And so the trick for all of this stuff is to try and associate our internal information to the external information. And there's a number of steps that we go through here. You know, number one, let's have the information recorded inside of a little graph that we'll go through in a second. Then we can augment the graph with how it gets connected up with the orchid ID. And then we can find out further information in particular about other external records like his publication record. So almost sort of redescribing this same state is sort of in a fundamental sort of way. What we do is we've got a GN network catalogue with a lot of this information. That is via the utilities in the research graph system, harvest that and puts it into a Neo4j, which is a type of a graph database, just the one that we happen to be using for this. That Neo4j is just hosted inside of the cloud. That has our information, it's just a recasting of the local information and put inside of this system. And then go out, what we do is go out into the research, a broader research graph on the outside world, and we augment then the local graph database with that extra information. And then we can visualize it in various ways. And so that's sort of what this image, and there is a graphical tool that comes along with this to start seeing a whole bunch of connected things to do with this data that can start to be exploited. So if we just had the local information of various datasets, then all we would have is the left-hand side of this. Through that extra augmentation going and querying in the international research graph and then augmenting for the local data, we end up with a much richer set of information about what each of the individual datasets and researches and what they're doing and their associations. So that's pretty simply what's going on. The research graph system that's been put in place really by the partners, and particularly Amir, sort of driving this, you know, interoperates with a whole bunch of different services, ORCID, data site, SCOLICS has come on board and other major data centers like GASIS and so on and so forth. So there's a list there and a growing list of information being put into the, to an interoperable graph system. And so now there's sort of a richer and deeper details that we can start harvesting. And there's actually, we did sort of the simplest augmentation as the description on those previous page, but actually you can run several levels of augmentation. And, you know, we're still sort of, I guess, trying to explore what's the best way of augmenting the data of what are the questions that we're trying to face. So look, I'm going to hand over now to Jingbo, who's going to take us a little bit more through some of the details of research graph and where it's going. Oh, hi. From this point of time, I'm going to go through a couple of slides in the next 10 minutes also to demonstrate how do we implement the research graph pipeline and also report what are we currently working on plus some future plans to go forward. So in this slide, it just shows you what is the input and what is the output. The input is NCIS metadata database. As you see in the previous slides by Ben, our dataset available in GeoNetwork in various formats, could be CSV or XML or JSON, they are the input so that Jingkin server take that input from the GitHub and build the NCI graph. So the output will be NCI graph. On the right hand side, the bottom screenshot just shows you how easy to maintain and update the database with only one click of the button. The five different modules in green color shows you the step-by-step inside of the Jingkin server to build the NCI graph and also augmentation with other database such as the deal orchid. So what we get eventually is an NCI graph ML. There are different ways to visualize the graph. One way which was not presented here is we can use the GAPI software to visualize, but more popular way would be we present our graph in a web-based format. So if you click that link or type this link in your browser, you can actually see this is online. And I'm going to show you three screen shots on this web page followed by a little live demo afterwards. And basically, this is the interesting part. Once we get the graph and we're going to analyze the graph and try to tell the story from the graph. And the first screenshot just really give you an overview of how many publications in our augmented graph and how many datasets and how many researchers here. I'm going to run a little live demo to repeat the story that Ben told you about Peter Oak. And if you type this research graph.org slash NCI. All right. In the web browser, you can see a web page about NCI graph and click that orange button. It opened a new tab to show the graph. This is the actual graph look like. And if I find Peter Oak as a researcher and click that one, it only show the connection with this researcher. The color code of the dot is that this is the dataset, which is the blue link real analysis data associated with Peter Oak. And if you notice, there is another green dot over here. And this is the augmented part from the arcade. And the blue dot represents the publication which associated with this researcher. So this really demonstrate that through the augmentation, our own database with the dataset and researcher are connected to the rest of the world. Let me go back to my presentation again. And I should say that we did play around with the different analytics. And this is the most interesting part. And we demonstrate a few cases that we think people are interested. For example, what is the most of publication related to a researcher? And this researcher is always identified with the arcade ID and also which researcher has the most dataset associated with him with his affiliation. And on the right hand side, if you are still with the web browser, you can actually put your mouse onto some of the name. It will only show the connections between this researcher and other researchers. So it's a more like interactive mode. I should also say that this augmentation is still working progress. It means that we can augment with other database such as data size or other European data repository. And we can actually make our graph bigger and bigger. The last screenshot is just showing the number of publication along the year. And as I said, this is not a static graph because we can always augment with other database and we can introduce more publication if it is not in the arcade database. So behind the scene, we use the Jupyter Notebook to generate this web interactive format. And we plan to play around more by providing maybe predefined a query so that people can put the person's name on an arcade to find out, oh, what is the connection between this researcher and the publication and the dataset. And in the future, even the grants if it's available in our database. So next is we think the research graph can be useful for a number of different group of people. And we think also provide research graph in the link data format would be beneficial for people who wanted to work with the more machine searchable and actionable approach. So what we've done is we did a bit of proof concept work by extending our current format of the research graph in JSON to JSON-LD using schema.org to enhance the semantic feature of the research graph. And we have a publication last year talking about the approach and the ideas. So the reference is at the bottom of the slide. The other thing is once we build the research graph, there are a lot of interesting analysis that we can do. So we are currently exploring the new ways of analyzing the information in the research graph and trying to pick up the good stories about what research graph can tell us. The other thing is because we are the national data repository, we actually encourage people to do the cross-disciplinary research based on our high-performance platform. If we can demonstrate the value of cross-disciplinary research by showing that when different types of dataset are available on the same platform, more research, more publication, and more funding was granted, it will be quite good to demonstrate the impact of our data management practice. So in summary, I think a research graph really means a couple of things for a different group of user. For example, for a user itself of the data repository, they can understand the dynamic research integration through this analytics. I remember when some researchers submitted an ARC grant, they sometimes showed their publication citation along the year being increasingly better and better. But with the research graph, they can actually show more information, not just publication, but also their contribution of the dataset and their award on other additional funding using the research graph for the higher-level executive and board as a data repository. And we can demonstrate the value of our good data management practice and provide the interoperability of the data services through these more advanced services. We also advanced the science research by having more publication and more impact in the matrix. And finally, for the funding body, since they invested a good amount of money for the data repository, we can demonstrate the impact of the investment on the data repository by showing the quantitative analysis of the impact matrix within the research community. So if you wanted to learn more about the graph, we have the GitHub source code and we also have the interactive demo of the graph. And there is a Twitter also, if you wanted to socialize it. I think that's it. Okay, thanks. Thanks, Jimbo. I'd like to thank Ben and Jimbo for giving this talk. And thank you, everyone, for attending the webinar. Thank you.