 Okay, so this talk is about the Research Data Switchboard. This is a collaborative project that came out of one of the working groups in Research Data Alliance. In the next 20 to 30 minutes, I give an overview about what it is, how it happened, why it happened, how does it work, and how you can actually take benefit from it. So the actual project, it came out of a working group in Research Data Alliance, the working group called Data Description Registry Interoperability. This group was created by a number of partners initially for enabling the cross-platform discovery between different services, and the actual goal of the project was creating a platform so we can actually connect data between the systems that these providers were hosting in their own infrastructures. So the project was, at the moment, at the final stage of the working group. And before I get to the project, I will talk about Research Data Alliance and how these things are actually coming together. So Research Data Alliance is a collaboration of different international partners who are all involved in the Research Data Infrastructure. It is funded through the coalition of Commonwealth Government, European Commission, and from the U.S. National Science Foundation. It started in 2013, and then there was rapid growth of different groups inside the Research Data Alliance. The structure of Research Data Alliance is basically there are interest groups and there are working groups. So interest groups are a group of people who are actually interested in a topic and then talk about the topic, but the actual work and the projects are happening through the working groups. The working groups have a timeline. So we have started with a specific proposal, and we've worked between 12 to 18 months to complete those timelines. The group, when it started, we had only Anne's, Joya, and Cern as part of the group, but later on we got a number of other partners in this group. You can see the names of those groups and those partners in the slide. The group initially started with the idea that we can actually do much better than just keyboard searching across our platforms. So one of the ways of thinking about this, we were looking at the Amazon systems and we realized, when you look at a book, it tells you what are the other books by the same author, or what are the other publications in the same category which are tightly connected to this and has been purchased by the same viewers before. So we thought, okay, we can implement a similar infrastructure, we can actually connect the data sets across our platform based on the joint collaboration, based on the authorship, based on the same grant or connected publications. Beyond just vocabulary search inside the content of the data set, we were planning to connect information based on the other information that in this color reports connect the data to publication and grants and other systems. In other systems, you will get a similar connection. So we will have a bridge between two data sets based on, for example, a joint grant, or if a professor will collaborate with someone else in another country and then they also work on the same data set or connected data set, you can see all of those connections visible. So what I'm going to do in the next year's slide, I'll show you actually what is a problem in practice because everything that I explained, you can actually do it today using the manual process of actually reviewing the literature. This is a data set in Research Data Australia and this data set has been curated and published from University of Sydney. Now, this has a link. You can click on this, you can go to the page and you can review the data set. Now, a question that I have is that we have two authors on this work, Professor Katherine Bloch and Dr. Emily Wong. Now, actually, can I find anything else on the web from these researchers? Can I find, have these authors published any other data set in this domain? So if I search these authors on the web, so this is actually kind of a problem statement. Can I find other data sets by the same author? If I search them on the web, I will find a page from Professor Katherine Bloch. This is from the University of Sydney and when you basically looking at this page in detail, you can see that there are different information about the author including the grants and including the publications. So in this page, if I scroll down, you can see the list of journal publications, conference papers and grants and book chapters. Now, there are 105 more publications in this list and we can technically search every single one of them on the web and then do the literature review to find out exactly what are those and if there is any data set connected to them. And if you do that, you will find one of them is a data set in plus one, sorry, a paper in plus one, which is a paper actually in the body of the paper, there is a link to a data set in dryout. So I went through all the trail of links from the original data set in ANS to the author, to the paper, to the data set. And this is another data set which we actually discovered in the body of that paper by the same author. So the process that I explained is basically fulfilled the requirements of the initial activities. That means we actually find another data set by the same author. And this is actually much more accurate way to find related data sets rather than actually doing the keyboard searching Google on the same topic. And I remember when I was doing the PhD, I was doing similar thing around the literature to find other books by the same author. So it actually gave me a cohesive view of the activities and discoveries in that domain. Now the problem is that this is not a really scalable platform. So from that, we got to the concept of, can we actually automate this process? And that's how we got to the metaphor of the switchboard. So in the past, there was in the telecom companies, there was a group of people who were actually connecting the lines. They were actually connecting one person to another person manually from the address book. Now these days, these sort of activities happens completely automatically using the computer and automated systems. And the goal here is that to make the same thing for the research data environment. So to make this happen, we actually built a system. And when we were building this system, one of the goals of this working group was not to actually invent a new metadata schema, do not work on a new standard. The goal was work purely on a new software development project using the existing technology. And the reason for that, we had only 12 months later extended to 18 months, but we had a limited time and we wanted the production level infrastructure. So to make that happen, we actually use existing data mining techniques and existing software development technology using the existing schema. And we also have the platform that it's as much as possible is agnostic to the schema format. So it can basically digest many different formats. Now the actual architecture of this system contains three different layers. We have the harvest server, we have the graph creation layer, which does the bulk of the job. It has different integration to external systems. It uses Google API, uses the different search technology to actually make all the connections happen. And then we actually push this information into the API consumer layer when you can actually ingest this information into your system using our API or there is a browser interface that you can actually explore and look at the connections in this environment. Now to explain exactly how this actually works, I am going to actually show you the graph behind this. This is almost looking at under the bonnet, but it is a good exercise to see the concept. So on the research data switch board, there is a graph database which called Neo4j, which basically is aggregated of all the connections and information into the system. In this environment, we can actually write queries to search for different records and information. In this example, we look at the data set from trial. So if I look at the slides that I had here, this was my DUI and I can actually search for this UI. So in this environment, I can write a query, although another note here that you do not need to do that, it's just for demonstration purpose here. And what will happen here is that you will get a data set, which is basically, this is a dryout data set. Now I can click on this and this click gives me another data set from dryout and a researcher. So this researcher, this is discovered by the computer basically. There is no manual edit or there is no authority format around this infrastructure. This is all done by the harvesting and automated system. So the researcher here, which is discovered and connected to the data set, if you look at the URL here, I can actually transfer this URL to the browser and check the researcher in this case. So this is the same page from the professor capturing the love which I have on my slide. And this is a list of papers. Now in this environment, we can query the system further. I can click on the researcher and I get all the publication by this researcher. I can get the grants or I can get another data set, which is also from dryout. So this information again all collected through our harvesting system. Now when I look at the different grants here, these grants also can be investigated further. So in this case, I need to probably zoom out a bit, so I give myself a bit of space and then I can move this one here and I can grab this and I can query this further. So in this scenario, I have other connected records. They're all coming here from ants. And this is a party record from the professor catching block, which is the manager of this grant or the participant in this grant. If I click on this, I can actually see further records from the same person, which in this case is another data set. Now the same thing as I did with the dryout record. I can actually grab this URL, go to the web and look at the record. So this is where we have started from the first place. So the process here is that the graph database actually collected all of this information into the core of this and then we will later on expose this information into the environment that you can actually right now you're looking at this. This is a browser interface, which is based on the ant software package. So here when you go to the homepage, similar to the ants research data Australia, you have the option to search for different records and when you find it, you can actually get further information about that particular data set and connections to other records. So once this one gives us a response, now one thing here is that we have two versions of this website. This is an American hosted version, which would be a bit slow. What I will do here, I will actually post the Australian version, which is hosted currently in Australia for you if you need to actually access this locally and then play with the environment. So I'm posting this to chat to... I can only post it to Susanna, but I think Susanna can post this to everyone else. Okay, so there's one thing that's actually I can show you here. Now when you, in this browser environment, when you go to a particular data set, this was the one that we looked in the first place, underneath there is a widget that actually visualized the same connections and we have this widget for every single data set, grant, paper and research in this environment. So if I actually look at another data set in this environment or a grant, I can actually double click on this, I can go to that page. If it is a data set, you will get the graph. That is one of the... This is at the core, the job of this system is connecting information to data sets. So we produce and visualize information for the data set in this environment by default, but there is, at this stage, our graph visualization is limited to the connection to the data set as an initiating point. Now the widget that you can see here, it is open source and you can actually not only host this, not only install this widget on your own website, but also through the API system that we have, you can actually pull this information, pull the graph and put it in your page. So you can basically have a graph of your, graph of your data set connectivity on your homepage. So the source code for this one is available on GitHub under the MIT open source license. This is another slide about how these things are connected and what do we mean by the degree of separation. This is an example of three degree of separation for the connection between different data sets. So we have a data set which is linked to a researcher. The researcher is also of a paper and that paper is actually connected to another data set. We can have a data set which is linked to a paper. The paper is actually have acknowledgement to a grant and the grant is linked to another data set. These are the sort of connectivity that we actually get a lot of them from our international partners. For example, this kind of connection between the paper and grants, we started to work with our international partners and there are a lot of information that come from the Europe and US about these kind of connectivity that all get aggregated into the body of the switch board. Now, in the last example, we have a data set which is linked to a researcher and the researcher is the participant in the grant and the grant also is linked to a data set. So this is an example of three degrees of separation. In the current switch board browser, actually we visualize everything up to four degrees of separation and that is the limit of the system exploration at this phase but you might extend this later on based on the further demand. The point is the more degree of separation we go through these connections, the more the inference engine needs to work harder and actually getting the data out of the system also will be more complicated because you get more information coming connected to every single node. So if you want to actually access this information, you can actually using our browser interface that it is at the moment, it is in the beta release. We are planning for the final software release in probably late September but the date is not confirmed yet and about the API, the API is actually available right now for the beta test. So if you want to actually participate in this project as a beta test site, you can actually send me an email and then I can give you the key and you will have access to our API at this stage and the code also is available on the GitHub. Now if you want to actually include your data into the system. So the previous slide was about getting data out of the system but now if you want to actually put your data into the system, currently the switch board platform harvesting records from the ANS repository from Dryout, from CERN, from Feature, from Oracle and a number of other partners. But by default, we have the capability of reading information in the FCS format. So if your data into the, if you have it already in the research data Australia, then you are reading that information. If you have the external repository that you want to add that one particularly to this data site, you can do that. There is a harvest platform that can actually read this information independent from research data Australia and if you want to actually see improved discovery on your data set connections, these are the things that we suggest to include in your, in the FCS object. So as a result of this, the inference engine provides a better result or more number of connections. So this connectivity is one of them on a data set on researchers. So when you have the researcher connected to the data set, then we will have more information to actually go and crawl the web and find. When you have the researchers connected to grants, so the level of accuracy will improve. Same thing about the connectivity researchers and publications. And when you have data sets connected to grants and publications or publications, both of these sort of connectivity is also improved accuracy of the result. The system by default as it is, it is gearing toward a high accuracy. That means if you have a case that we identify as not 100% accurate or not close 100% accurate connectivity, the system drops it. So that's why increasing this information actually helps to find more connections because there might be connection already in the system that the inference engine look at it and says, I don't have enough evidence, there's not some name ambiguity here. And for that reason, I cannot actually establish the connection. The most important point of all of this is that using the precision identifiers, like the UI for the data set and publications, the peril for grants and or for researchers or ISNI, in some cases we can actually identify as the records. And if there is a connection between the ISNI records and international publications, that also would help. And the National Library identifier that one is already built into the FCS model. So we also have that connectivity into our environment. Now to get more information about this, well, we have the research data switchboard.org website which provides abstract information about a project. There will be new updates on that website very soon about the release date and about other infrastructure components around the system. I am missing actually I realize I have missed the GitHub repository link in this slide. So I will actually post that one again into the chat box or actually I can do better than that. Later on after this slide I will bring the whole GitHub repository. So you can actually copy that from there. There is a Twitter handle for this and we can send you all about this. Now the RDA Alliance Working Group is a core part of the activities that form this project. You can join the group. I will highly encourage you and close to the next plenary which happens in September in Europe. There will be more presentation and talks about the progress of this project. GitHub repository for this one. This is the core part of the system. You can see the harvest region. We have some other repositories which is connected to our system. All the code here are open source and all of them are licensed under the open source code so you can actually use them. Since at this stage we are actually in the pipeline of development, some of the repositories here they contain the codes that you may not be able to compile them. Part of the reason is that these codes are integrated to other codes so they are not meant to be deployed by themselves. If you have any question about this please contact me directly on the email and I can guide you through how to compile the code and how to use it. Okay, so by this I can actually go back to our final slide and then from this point we can actually open the discussion. If you have questions I can actually answer them. Thank you very much.