 Good afternoon everyone, welcome to the webinar this afternoon. My name is Amir Aryani, I'm working for ANZ and with myself today I have Dr. Jinbo Wang from NCI who will be the co-presenter in this talk. We are going to talk about the Neo4j technology that we use as part of the research data switchboard and I will give some background, we will talk about some technical aspects of this and then Jinbo will talk about the NCI implementation of this technology. So the agenda for the talk would be the background on the research data switchboard and the research graph which is the data modeling behind this. Then we will talk about the Neo4j queries, I have allocated something about 10 to 15 minutes on the technical side of this and then we go to the NCI implementation and at the end we will have time for questions and answers. The background on this work is it is started from the challenge of cross-platform discovery of research data, it goes back to 2004 in the research data alliance working group when we had the problem of finding the related connections. So some of you might have seen this slide, it's actually one of the earlier slides in this work and I actually kept using that because it actually shows the earlier stage of the problem and although we have solved this in the research data strategy at a great degree but so many repositories already have this issue. If you have a dataset and you want to know what else in the scholarly communication can be linked to this information, usually the keyboard search is not an efficient way. So in this case for example in the ANTS page in 2014 we had a dataset that was actually in the page for that we had a cross link to a dataset and we were doing the cross keyword search for title and related keywords. Now the problem was the queries that comes back from that was including a lot of false positives. We had in this result we have more than 1000 records connected to the dataset that was supposed to be a recommendation for the researcher that these are related information to this dataset. In practice that is not very useful when you have so many recommendations. So one of the initial ideas behind this was how the Amazon or other retail stores do this by looking at a service because it see also service. If you look at the book they tell you do you want to look at these five other books by the same author or by the same publisher or someone who have already purchased this book also purchased those other books. So usually those recommendations are very precise and are very limited. So you recommend three or five options not 1000 options to the end user. In this context we started a working group in the early days. These are the initial partners that joined the group at the time to basically address this cross platform discovery challenge. We had a significant contribution from the dry out certain data site and later on we had open air and other partners joined us. In Australia we had the collaboration of NCI and also University of Sydney. And also there are other universities who have been involved in this project like RMIT or Vivo corner in earlier stages of the working group. Now this is about the research data alliance and usually when I get to this part of the presentation, I talk about the structure of research data alliance. I think in the scope of this talk we don't have that much time for this. But just as a brief note that research data alliance is a joint venture by main funders that they invest in the data infrastructure. And the main goal is that the people who actually work on different projects, they need to coordinate and collaborate. Now there are different working groups and each working group have different area that they work on those. It's almost like projects. Now in this environment of the research data alliance we had a working group which started from 2014 and concluded the main deliverables in 2015. And we are at the moment continuing to maintain the work and extending the platform. Now the working group, the main recommendation after multiple prototypes was the data sets can be connected using the co-authorship model. And although in principle it's a simple idea but in practice it requires connecting information across different infrastructures. Now when we were doing this at a time we were looking at the first stage of this process was looking at how this can be done by librarians. And that's where we actually went to the process of asking a librarian or actually we had a case study group to look at this process and tell us what they will do so we learn from their practice. So in this context this is what they call it new version of the research data alliance after couple of updates unfortunately I didn't get the screen shot from the same data set in 2014 so I have to go have a new photo. This data set is from University of Sydney and when you as a researcher or someone works in a library you look at this data set you can identify the researcher that has worked or contributed to this data set. Well you can actually search for that person on Google. You can find the researcher and the profile page of the researcher including publications. When you go to the publication list you can if you want read every single paper and in the content of those paper you will find data sets in other repositories as been cited or mentioned. So in this case we have a data set from the drawout repository. Now in the drawout repository you can actually see the same researcher although with different name abbreviation and the publication is basically disconnected from University of Sydney. So the logo of University of Sydney are just put together in this slide to emphasize the connection but in practice when you are in drawout environment you don't know that this work is connected to a researcher in University of Sydney. So if we go to the chain of all of these connections we will get something like we go from a data set in hands to a researcher in University of Sydney with article in plus one to a data set in drawout repository. Now this process the first activity of the group was to actually demonstrate that it works so we went around about we had about 250 collections at a time. Then we did a study around that and we established the links. But obviously it is not a scalable model you cannot do this for every single data set and every single researcher. So we decided to use machines and in this context we so the goal here was we wanted to have a solution without spending too much time on the research and inventing the standards. So we adopted all the standards that we could from other groups and other platforms and we tried to implement something that it is simple to adopt by others and also easy to maintain. So the overall structure is basically have three different layers. The first one is a harvesting layer. It's basically is OEI-PMA and it reads a couple of different formats Dublin Core is the most obvious one but also it reads FCS, it reads Mark 21 and DLI, DDI and couple of other ones from the international repository. Now the list of these actually is in the working group list which I believe I have a link to this further in the slides. The harvesting layer put the information into a set of machines that in this case we implemented them in Amazon but you can do this on Nexar or any other high performance computing platform. The main function that's required from those platforms that they should be able to run Java programs because everything is implemented in Java. And what those programs do is that they basically read the information from all of those points and connect them together when the connection is possible. It resolves the identifiers, it uses the cross-reference integration to get the metadata from the DOIs, the same thing for data side, it does the same thing for ORCID, it has a Google API integration when we have the grant or paper and we actually search for certain domains in universities to find the profile pages. We use this one for some level of disambiguation and then we are doing something called node linking that goes across the graph and link the nodes that there is a there's an inference component for those connections to happen. Again I cannot actually go to the detail of this, there is a document in the working group recommendation that talks about a relationship that's called known as. So if you find two different elements that those elements either have the same identifier that includes any kind of UI or there are the same title with the similarity in other elements, we link them as known as elements and then in the node linking a stage they actually link them together. Now the point is all of that information would get together in one database and in the case of our project we use Neo4j. The main reason for that was one, the simplicity of implementation, it was very easy to hire programmers who can actually code Java and who can actually write program for Neo4j and also the performance. So the speed of querying the database from Neo4j is much, much quicker than the if you implement something using the RDMS or use the same thing like a triple store. So the Neo4j was the main point of aggregating all of these connections for us. And in the data access layer after doing something that we call metadata harmonization which basically harmonizing the names as much as possible or by names on the property names as much as possible toward a doubling core metadata we ended up with a graph model that we call it research graph at the time. Now there is one characteristic of this meta model which is different from the other schemas that you're harvesting. This one, first of all it's run mainly by URI, we have the URI for most of our elements where it is possible to convert information to URI. The example of that is converting the DOI to URI, converting the grant ID to parallel. Everywhere that we had the option to come to convert something to URI we have done it. And that enables the scalability of this graph to a distributed graph across multiple platforms. And the other thing is we have the separate relation object which if some of you are familiar with FCS we have the party record, we have a data which is a collection, we have services, we have activities. In this model also we have a relation object. And the advantage, the main advantage of doing this is that it enables connecting the nodes to a URI which actually don't exist in our ecosystem. Example of this is that if you are looking at the data repository in a university you might have actually orchid record as a related identifier. And in this model we just put the relation object that says this record is linked to orchid and we don't have to resolve it. It sits there by the time later on in our inference model we can resolve the record to that orchid identifier and then the graph system actually handles this as a bi-directional relationships between nodes. Now I were not supposed to talk about too much about architecture because we have a lot to talk about the actual Neo4j queries so I'll just go ahead quickly on this topic. There are these are examples of multi-degrees of separation I'll just go through one of them. So two database can be connected if for example if a dataset has the contributor which is an author and that author actually published a paper and that paper cited another dataset. This is what we call three degrees of separation and we have multiple of those based on different scenarios. This is a link to the research graph model in the interest of time I think I will skip this slide without talking about the individual elements there's a link to the schema and if you have a time at the end of the webinar and if you have questions about that I can actually come back and talk about the schema later. And these are the links the slides will be available so you can actually you don't have to write this write the links later on you will get the slides and you can click on these links. The next part of the talk is about Neo4j so that was actually the main motivation for this another review of the switchboard project if you like. We implemented the research data switchboard in multiple different institutions in Australia NCI has adopted that University of Sydney has adopted that ANZ is using this and also in Europe we have multiple partners who are using this technology. Now we came up with the same questions again and again and these are different queries that people ask how can I do this? How can I find my data set using DUI? How can I find all the data sets that are from this particular publisher? So in this part of this presentation I'm gonna walk you through some of these scenarios. Now one thing about the Neo4j is that it has a graph browser and for the purpose of this presentation and also for the research data switchboard we have the extended version of this browser website which has some built-in functionality for the queries related to the scholarly works. Basically if you can think about this as an extended graph explorer for Neo4j. In this environment one of the things that you can do as an example of the Cypher queries is that you can search for a data set. And what I'm going to do is that I'm gonna search for the same data set from Dryer that we just look at it in our example. So here I can say I want to see the DUI where, well let's still look DUI from the slides. So what I said is that give me a data set from Dryer when there is a DUI and that DUI equal to this. When I do that it basically comes back with the kind of red slash orange dot that we present that data's record. You can look at the content of this here. It tells you this is a record from Dryer, this is a title and these are a list of authors. Now what we can do in this environment we can actually query this. But are it double clicking on that or just do one click and then click on the expand button. So here you will get the other information. There are four other data sets in the Dryer environment linked to this. There is a paper which is this one is from plus one article. This was also one of the earlier slides and this is a researcher. Now in this environment I can keep expanding the notes and what they call this traversing through the graph. So here I can see all the publications for that record, for that researcher which in this case is catching the love and then all the grants that you got. And also for these grants some of them actually have connections and I believe one of them is also connected to another data sets. Now this data set, now in this environment if I want to look at the metadata of the record I can actually expand this thing in the bottom of the page and here I can see the title which this is the first data set that you started from it. So back to the point of multi degrees of separation. Here I have a data set that it is linked to a researcher grant and goes all the way back to the initial data set. Now in this domain you can do a set of queries and I actually have a list of things that we need to look at it. So we will going to look at the how to find a data set, how to find a publication grant and researcher, how to find links to orchid record, how to find data sets that have DOI, how to find DOIs using prefix, find highly connected data sets that is using a number of edges on the graph, connections with multiple degrees of separation and find shortest path between two researchers. Now this might end up to be a bit overwhelming with going through all of these cipher queries. What I will do is that if some of these things feel complicated the slides will be available online so you can actually always go and try them and then basically send me an email and we can have an offline conversation about the syntax of the queries. In this case, so we did already check finding a data set by DOI but we can also find the data set by title. So the way that it works is that any node in our research graph model has a property called title and in this case the title for the data set is the one that we get from the metadata record. Now this is a simple query. You can do the same thing for publication. You can actually get the publication record. This is one publication query from a certain database. Now I can go here. I can actually post the same query here and then I can actually get this record from the database. Now one thing I want to point out is that if you see the query that you run it actually can take longer than usual because the size of the graph is a big database and in this case you have about six million nodes in this graph database. And what we have just run here is that it's asked for a string search in a graph database. For some of us who are familiar with the architecture design of databases the graph database is not designed for string search. Now there's a trick about this and then I was doing this I thought it would be a very, very good example. This in this case it goes actually to the six million nodes to find the related element but we could actually do this much better by just making this query more precise. If you know it is from CERN we could just add the CERN namespace here and also we can add the limit number at the end. What it does it says the first one that you found it come back don't go for the rest of this. I know it's only one instance of that. So what I do here if I hit the enter button it comes back immediately. So the way that you make the query have a direct impact on the performance that you get from the graph database. Now this is another example of how you can find the title. This is the same as a data set so there's no complexity about that. There is a, I want to find a record from CERN that is a publication and it has a title and this is a title for that. Now for the grant we have another property that it is useful and that is called Per persisting URL. In Australia for all the ARC and NHMRC grants we have Per which basically is this name space and the grant ID is at the end. Basically you will have Per.org slash AU research slash grants and then a slash if it is ARC would be ARC dash grant number. If it is NHMRC it's gonna be NHMRC dash grant number. So I can actually copy the same query and go here and paste it. But before I paste it I want to show you another thing. Every query that you type you can actually hit this button and add that to your favorite. So I have the list of the favorites of all the queries that I want to run just in case that I forget to type one of them. So in this case I can actually go to this list and click on that it's the same query. I can run it and just to make it quick I say just give me one. Now this is our grant and you can look at the content of the grant here. Now the same thing you can actually search for grant by title obviously. Now for the researchers there are a lot of options. You can search the researcher by first name or last name. You can search the researcher by ORCID and also you can search the researcher by ScopusID. In this system we actually get the ScopusIDs which are linked to ORCID and we index them. So if you are looking for a researcher with a ScopusID you can actually find that and the query would look like this. So you say I want a researcher with a ScopusID equal to this and give me the researcher. Now this query I think now we are getting to the queries that are more complicated. So what we are going to do here we are going to find a connection. So we are not only looking for a particular node I'm looking for a nodes that satisfy the specific criteria. In this case we are looking for a record from DRAIA that is linked to a record from ORCID. Now one thing is if you ever try to copy this information and copy this cycle queries into PowerPoint presentation or Word or other text editors what they do they fiddle around with the format of this dash and other specific characters. So just be mindful on that. When you copy and paste sometimes they are actually being distorted. So back to the topic. In this case we have a dataset and this is from DRAIA and this syntax basically says I want the only nodes from DRAIA that are linked to ORCID and then I want to get the count. I want to know how many they are. So this is actually introduction of another syntax here which is a count element. You can actually rather than returning the nodes you can return the number of nodes. So in this case we have 1,231 DRAIA datasets that are actually, well actually dataset and publications that are included in the ORCID profiles. And the way that we can actually see what they are you can just say return n. Just give me 10 of those records. I get these ones. I can expand one of them randomly. So this publication, now actually this is another approach here I'll explain another thing. Now when you look at the node in the research graph model there is multiple labels for each node. The labels identify the source. In this case we have one publication which came from ORCID and crossref. That means we have two different sources for information that have the metadata for this record and then we merge them. Now in one of the early presentations someone asked me what will happen if there's a conflict about this metadata? What will happen is a title in ORCID is different from title in crossref. The way that we manage this is that there's a priority. It's basically there are different sources that have authority about different information. For example, anything that is related to DOI if that DOI is registered with crossref the information from crossref will override other nodes. So let's say in this case the title in the ORCID record might be modified by the researcher. When we ingest this and then we do the integration with crossref we actually overwrite the information from ORCID by the crossref when it gets to the DOI. Now I know this might actually lead to false negatives but this is actually for us was the most practical way to manage that. Now this publication we can expand it. It goes to the ORCID record. And if we expand the ORCID record we might get some information attached to that which in this case we didn't. So that is just one ORCID record linked to one publication. Going back to our slides. Now this is another type of query that might be useful for many of us is how do I find the records in ANS that are linked to ORCID? So in this case I want to look for records in University of Sydney that been contributed to the Research Data Australia and they have a connection to ORCID. So the only difference here is that I use a new element that called ANS group. Now there is another characteristic of our graph model is that unlike many different, well some schemas have a concept of profile in the graph model we do not have actually a concept of profiling. What we do is that we have a standard sort of field but also the repositories with their own domain specific fields we can ingest them into the system. And that's actually one of the examples that we did for NCR ingesting the organization nodes into the graph structure as they had in the geonetwork data model. And it works because the system is actually agnostic to the metadata elements in the individual nodes. So basically in the graph database you can have a hybrid model. So in this case we have ANS group which is a metadata element from ANS and we say okay that is equal to the University of Sydney and give me 10 records of that. So these are the records from the University of Sydney in ANS which they have a connection to Orchid. So for example this is a dataset. I can expand this. These publications are Crossref publications. I don't see any Orchid records here. Actually yes it is. So this is our Orchid record. Okay so this is another example of actually this is another useful thing. One of the things that you can do in the graph data is you can search for properties. So I want to know all the datasets that they have DUI. And I can actually copy and paste in our graph database system and we can hit the enter button and see how many datasets we have here that has DUI registered with. So the answer is 57,000 records in our database has a property called DUI. Now you can replace this with other properties that you say I want to see all the grants actually they have pearl. And so it's 45,000 grants. Now the other example or one of the queries that we had especially from our European partners was finding the datasets by the prefix under DUI. Basically the goal of this is you want to actually find the datasets from particular journal or from particular publisher. So this actually uses a syntax that's called regular expression. In Neo4j you put a tilde after the equal sign and then you put the dot start at the end which that means everything after that is acceptable. So when I do that it will return a lot of data actually here so I put a limit 10 and it gives me 10 records that those records have this DUI. So you can individually look at the metadata or I can just click on rows and I look at the record for this. All of them have the DUI that matches my criteria. Now another example here would be how we can find the highly connected datasets. So the things are getting more and more complicated here. And I have the good news I have only two of these complicated queries to go. So here what we have done is says, okay I want to see all the datasets based on the number of connections that they have. What we have done is that we say, okay I want to get a dataset actually this query is for and so I want to see all the ands datasets that have the most number of connections. So I say, okay well give me the key for that record give me the title of that so I want to know what it is and I want you to actually count the number of connections that it has sort them by the number of those connections in the reverse order. Do that it goes into the system and comes back quickly. So here we find all the datasets that we have by the number of connections that they actually link to them. So if you actually look at this particular node we will end up 757 nodes linked to that particular dataset. Now this is the last of our queries and this is a scenario actually not sorry I did a lie there is one more. This is an example of finding the links that goes through the multiple degrees of separation. The application of this query is a scenario like this. I am working for ands I'm thinking how many connections do we have to draw up? Well we might have connections directly but we might also have connections indirectly. The example of indirect connections would be I have a dataset or paper in ands registry that is actually claimed by a record or a profile in orbit and that profile in orbit also has connections to draw up. So the syntax for this would be something like this that says I want to get the older connections between one to three. So if I run the query like this and the outcome is return title and the key and only 25 of them what it does it actually returns the node title so this is a link from ands and then we have a dataset key from the draw up. Now the last of our queries are I found this one interesting for a lot of people in the publication domain when they are looking at the researchers or datasets or two different datasets and they're wondering are these two actually connected? What you can do in a graph database you can search for something that's called shortest path. Oh that's an example of the copying from the browser is not a good idea I don't know how these characters end up here actually I go back to my list and I'll pick this one as the last query I don't have this one safe, okay? So we have to fix it. So what this query has done it says okay I want to have a dataset from draw up with this DOI and I want to have a dataset from ands with this DOI and find me the shortest path within this two and it does it looks like this. So these two datasets are linked using a researcher and a publication. Now in this case you can actually replace a dataset with a researcher and instead of DOI you can use ORPIT. We can replace a dataset with a grant and instead of DOI you can use PERL. Now if you use the extended version of Neo4j with the research graph model you will actually get this last tab here which you can open it and it has a template for all of these queries. So for example for the shortest path I can click here and it fill up the box automatically for me and I only need to actually fill up the DOIs that I need and we are actually extending this further. So at the moment we have about 10 queries here as example and we are planning to add more and more queries to this template. So this is basically the last slide for the Neo4j queries. What we will do is I hope at the end of the presentation we will have a time to have a discussion about this question Q&A. The next part of the talk is by Dr. Jinbo Wang about the NCI. I believe I did not actually introduce Dr. Wang. Dr. Jinbo Wang is actually working as a data collection manager for NCI which is located in ANU in Canberra. Her background is, this is actually what the tricky one which we tried before the presentation. I believe she got her PhD in the seismologies as a seismologist which is in the geoscience and basically her mission at the moment in NCI is connecting information across different platforms by basically providing this ecosystem for researchers and make the research environment more efficient. Now with that introduction I will hand over this presentation to Jinbo. Hello everyone. I will use about 10 to 15 minutes to share my experience as a user of RD switchboard. And I wanted to show the graph connection experience using our NCI metadata. And this talk has also been presented two weeks ago in the first reproduced science workshop in Hanover, Germany. It's got positive feedback. So for people who doesn't know much about NCI, NCI is in short is for national computational infrastructure. So we are the national level super computer center physically located at Australian National University campus. From 2013, we received a big chunk of money to store the research data. The motivation is some of the data getting bigger and bigger to gigabyte terabyte even much larger especially in our domain such as environment. It's growing so fast. Individual PC or hard disk cannot store large scale data and transferring data and share the data became a problem. That's why we got this funding as one of the eight node storage to support the research data infrastructure. At the moment, we have more than 10 petabyte research data as you can see in this figure our data including from the space astronomy observation to satellite images and climate model, climate change research ecosystem onto the ground like geophysics exploration and even deeper mental core geodynamic processing data. With the funding of being one of the research data infrastructure, we make use of the advantages to work on the data that we collected to make a seamless connection across different discipline. So as you can see here, we care about data formats because we want to make use of the HPC facility and we care about provide open access to researchers because the large scale data is impossible for them to download to their local machine and do the processing. It's better to provide some kind of virtual environment for them to log in and do a big processing at our center. Because we have this so much data, we need to organize the catalog for people to know what data set available at NCI. So this is one of the common question researchers care about. The catalog we build based on a rational relationship between researcher data grant and the paper. For example, if you see the first line, it says researcher A use data one supported by grant A generate paper one and two. Similarly, for each line, we have one record. However, the obvious thing here is we can see the redundancy of researcher B appear twice, data one appear twice, grant B appear twice. If every single node is in our database, it creates a lot of redundancy. So the idea of adopting at the switchboard is we use the idea of identifier. And we use the identifier of the same researcher like ORCID. We use the same identifier of the data like a DOI. We use the same identifier like a pearl of a grant, et cetera. Now, after we merge those different nodes with the same identifier, actually each entity of researcher data grants and paper are now connected through a graphical relationship. I think that's my understanding as a user how RD switchboard can help us to make the connection because with this graph view, what I can answer question, for example, like what is the usage of NCI's dataset? It can be translated directly to an RD switchboard query that Amiya just to show you a little while ago, how many datasets published at NCI are being referenced in the research journal articles. Another question such as what is awareness of the available datasets within the research community? It can translate to a query question that is how many researchers, institutes are connected to the datasets and so on. So the third question, which is even more specific, if I would like to know more about this dataset, who should I contact? Who generate this data? Who use the data to publish the paper? And what is the previous research has been done using this dataset? I believe this is a very common question for a researcher that when they start a new topic, they would do this kind of research like myself doing a Google search first. But if we provide this kind of infrastructure, it will make the literature review much easier. And now I will use two slides to explain exactly how we organize our catalog and then adopting RD switchboard technically. So we organize our catalog in a hierarchy structure. On the top node, as you can see here, it's an NCI geo network node, which is only about the top collection level, high level summary of the data collection. At the moment, we have more than 200. And on the middle level, you can see every single project has its own geo network catalog. Geo network is our metadata display interface, but you can use other interface as well. So at NCI, for each individual project, we might have thousands of records for in the file level or granular level. It's not appropriate to have all these different granularity catalog in a single node because then it's hard to separate them and it's harder to aggregate it by research domain, for example. So we use this structure. It provides flexibility for us to do more aggregation at a later stage. You can check out our main geo network website using this link. So what RD switchboard do is, every single geo network has its own individual database and we dump those database into the RD switchboard graph database. And the connection has been made. Here, I don't show the exactly how it does, but that's the magic where it happens in this box when the identifier was used to merge the different nodes so that the connection has been made. This screenshot is the current status that using NCS metadata, we find some connections, for example, between the datasets researcher and institute. But I also noticed that there are nodes are disconnected in my follow up processing. I actually found out they are actually connected, but in our metadata, because it's lack of some critical information. So when I present this database in a graphical view, it's disconnected. It means that I have to correct some of the metadata information in our database. So, so far as I explained, the RD switchboard helped me identify some missing critical metadata entries, which I can provide, I can make it more complete. Sometimes it's also helped me identify the arrows in the catalog and I can easily fix it. But without RD switchboard, it's almost impossible, because we have hundreds and thousands records. It's hard to check manually, but RD switchboard can tell me immediately. RD switchboard graph view provides analytical view of how research data has been used so far. This has been a very common question being asked many times by our user, because they care about who use their data and they care about how to make the data even more public to make more connections to the external world. And RD switchboard is an ideal tool to make it happen. It also can help me evaluate the impact of the datasets researchers and institute based on assumption that more connection an entity has, it has a bigger impact if you like. Finally, as you see in the example in Amir's demonstration, if a researcher doesn't have an ORCID, those query wouldn't work. So it's really a good motivational or encouragement for researcher to register an ORCID, for data manager to mentor DOI for their dataset, for data repository provide persistent identifier to increase the accessibility of the dataset. So this is our experiences so far. I would like to end up my presentation by giving you a real example that how I feel it is really helpful from a data repository point of view. And that is the basic question, what dataset are connected to each other? So we have a group from the Bureau of Meteorology and they download the climate reanalysis data from US because it's too large and there are a number of people want to use that data. So they approved by NCI to store at NCI so that they can use it. After a little while, another group from Cyro also climate research group, they are downloading the climate reanalysis data from the same source, but different portion, different subset and they won't do some research. However, those two groups don't know each other but they both came to me, would like to find some storage at NCI to support their research. And I suggest them since you share the common interest, why don't you talk to each other? So group C, if group A already downloaded some data that group B can use and vice versa and then they start talking to each other. After a few months, a third group which is also from the Bureau of Meteorology but to different branch and they are asking something very similar, some question very, very similar about using and sharing reanalysis data. And I suggest the same thing. However, as a human being, as a communication hub, it's very difficult and it's hard and it's time consuming. I can see a good chance for ID switchboard now play my role as a communication hub to present those connections automatically in the graph database. So people can go there anytime 24 seven check the connection of their dataset and start talking to each other without talking to myself. It also reduce, it might also motivate a collaboration from different group when they see the connections. So that's my hope that the ID switchboard when NC are adopting it, so we can offer those kind of services to our user. So in summary, the ID switchboard is a great tool to create a linkage among researchers, datasets and the publications. The Neo4j Graph View is a very eye catching and straightforward to understand complex interconnections within the research community. I also feel this data management is a joint effort by the whole research community and the librarian community. And that's the end of my talk. I will hand over to Amir as a presenter. Thanks Jingbo, thank you very much. What will happen after this? After this webinar, we will have a BAF in e-research conference. That will be one place that you can find us. I believe Jingbo would be there as well. And we also have Nataniel from University of Sydney. We will talk about this technology further in the BAF and for some of you who are interested to get that Neo4j database that I was using for demo, that's actually quite a big file. I can give you that one with the USB disk. So that's one way of getting that. If, so that is a quick way. In next week, if you come to the research conference, if you find me, I can give you the file. If that doesn't happen, then Jingbo and I, we are actually putting that database on the NCI platform as open access repository. So other people will be able to download it. That is a publication that you are in the middle of it. So I don't know exactly how long does it take for that dataset to appear. It should be that one also soon. Regardless, the slides will be available online. If you have any further questions about this Neo4j technology, you can send me the email on this email address. The entire code for both research data switchboard and the research graph database structure all on GitHub. So the links are on the earlier slide. I can also bring the screenshot from this. So what will happen is the first thing you probably would need to do if you want to create a graph database, you might want to go to Neo4j and download the Neo4j source code and compile it. But the easiest way would be we have a Neo4j repository here that's actually compiled. So every plugin, everything has been built. We can just download it and write. So that would be your database. Ready to go? Your schema is on the schema repository here. And also there is a page on the web that explains how the schema works. There are some crosswalks on the repository that you can look how you can do the crosswalk if you want to import it. In the Neo4j structure, if you remember there was a harvesting point. So when you do the harvesting, the information needs to go to the switchboard and the switchboard data is all on the GitHub repository under the switchboard name. We have a multiple instances also here. So multiple repositories here. And the code that as Jean said, make the magic happens is in the influence repository. Now we have about five minutes that we can actually allocate to question and answer. So we have one question from Christopher. So the question is that what processing power is required for the queries? Okay, so the answer to this is that it's very much depends on your two things. Actually graph size is the obvious element. And the other one is your indexing of elements for your properties. So in the ideal world, if you index everything, then that obviously makes it much, much quicker. That would reduce the computation power, but requires more storage. In the tradeoff would be you can actually index less properties and then you will actually need to have more computation power allocated to that. But the example that I've shown, it was actually in my MacBook Pro, which has a i5 processor. So it has a six million graph database, I'm running the go-to-vib in our software plus power presentation and also I have a Neo4j running on the background. So that is not expensive to run. The thing which is expensive to run is an inference engine. That one, it requires a high performance computing and a lot of memory because it is a multi-thread processing. So in our case in the Amazon, we have a machine which I believe it is in one of the earlier slides. That machine has 32, 36 cores and 60 gigabytes of RAM. And that machine takes about 72 hours to complete the pipeline. But if you have a machine, so the way that it works for the way that the pipeline works is if you have a machine with half of the power, it doesn't take twice the time. It actually make eight times longer. So for running the inference engine on the large graph database, you need a very powerful machine or a set of powerful machine as a cluster. Now, this actually opens a conversation about something that called distributed graph. I briefly mentioned that and that is why this graph project are now more adopting the idea of having the cluster of graph running on different platforms. And that is something that we probably need to open the conversation in another webinar or another technical meeting. So I have one question here. It says for RDS switchboard, is a form-based search option planned rather than the search query swing? Yes, the answer is yes. We are working on a couple of different options for this. One is that at the institutional repositories, we are at the moment exploring the idea of having the integration right into the repositories like DSPACE that enables just a whole platform to work like a plugin. And also in the research graph website, you are working on the idea of providing, as you said, form-based search so you can actually type your queries and get to the graph without actually loading the whole Neo4j. Okay, so the question says, is there any study comparing the Neo4j graph with other traditional views? And there are two different comparison here about, so the question here is, is there any study about comparing the Neo4j technology and other technologies? There are a lot of actual studies on the web which is search Neo4j compare and it's easy to do this on Google. It brings lots of other options. So there are studies with the comparing Neo4j with traditional databases like SQL database. There are comparison with Neo4j with the non-SQL databases like MongoDB. And then there are comparison between Neo4j and the triple store studies. So in this context, well, the first one is quite obvious. The SQL databases are tuned to the string-based search and they'll have different structure. The problem of those databases for this kind of scenario is that finding the chain of relationships is very, very expensive process to do. No SQL databases, well, Neo4j is one of those. So you are comparing different items in the same category. There are a number of those options. There is another, for example, similar product in this category called OrientDB. There is another one called TitanDB. So there is a couple of those in these groups. You can actually search and develop for finding the different performance differences. It is that the main differences are the performance, the simplicity of the use, and interoperability with other tools and platforms. And the semantic web, the main difference between Neo4j and a triple store is on the inference model. I would say Neo4j is far less capable of making a complex logic, but at the same time, it provides you with the simplicity of implementation and the performance in query. So it's much quicker to get the data from Neo4j. However, there are different triple stores. So different technologies. And I remember in 2014, when we were doing the project for the first time, we did some cases studies at a time. Our experience suggested that the triple store technology requires more computation power, but that might have been changed in the last two years. Okay, so the question is that is the complexity of inference improved the search quality? The short answer is yes, it does. So the last question is, is there already a switch for implementable in the search data alliance? In a way, yes, it was a collaborative project that was initially started by Ant, Dryad, and Sen, and then other people joined. So we had the infrastructure contribution. We had data contribution. We had coding contribution from different partners. And overall, we can say this is the implementation of what the working group recommended. So the working group came up with a recombination of different connections, and then we implemented that in the switchboard model. And the question was about the research data Australia. So the actual question was, is are the switchboard implemented in research data Australia? Which I mess read that. So the answer to this one is that we use the research data switchboard to enrich the research data Australia. It's one of the linking capabilities that Ant has, but at the moment we do not have the graph visualizer. This is one of the items in our pipeline, which we already have this one planned into our development cycle. And in the future versions of the research data Australia, we are working to have a graph visualizer that provides some of this information. Okay, thank you everyone. I believe I'm four minutes over the time of the webinar. So I would like to thank everyone for attending the webinar.