 Hello, everyone, and welcome to today's ANS webinar on linking data and publications, the Scolics Initiative. So let's get started. My name's Natasha Simons, and I'm from the Australian National Data Service, or ANS, and I'm your host for today. So I'd like to introduce our speaker for today, Dr Adrian Burton. Adrian is the Director of Services for ANS, and he's based in Canberra. So I'll now hand over to Adrian. Thanks, Natasha. So today we're talking about the Scolics Initiative, as Natasha indicated. This is about linking data and literature, and by literature we mean the published scholarly communications, could be journal articles, books, reports, and by data we're really quite broadly thinking about data, data sets, data services, models, software, etc. So today actually is a story. It's a story of a lonely data access portal way off in the Southern Hemisphere in Australia. A very important data set that's held there, it's data set D that we can see. It's a very important research asset, and the Data Access Portal is pretty pleased that they've been able to make it available as an asset for further research. As it happens, half a world away, in the Journal of Studies on the other side of the planet, someone has published an article, that's article A, and it actually has a reference to, it mentions the famous data set D from the Data Access Portal, which is really good news, because that's why they started the Data Access Portal to have, build new research on old research, to return investment on the investment in data, and to spark innovation in new research. So this is really good news, that a journal article has been written based on the Data Access Portal. The problem is, it's a very long way away, and in fact the people over at the Data Access Portal have no idea that this journal article has been written. So all their good work has come to fruition, but there's no way for them to know about where and which articles may have referenced the data. So the manager of this data center says, we really do need to know what's happening with our data. We really need to know when it's being used in research. So he sets a couple of information professionals afoot and says, okay, we need to sort this. I want you to search the internet. I want you to get access to all these journals. I want you to get the full text. We need to mine through all these journals, and we'll put the title of our data set D there. We know people are using it. We need to find it somewhere in the scholarly communications, and then we can build tools. And so for a number of years, they've got three information professionals scouring the internet, building tools, building up a view of what it means. Now there are lots of journals. That's the proviso here. At least tens of thousands of journals. To get an idea of how many journal articles there are, Crossref has over 75 million DOIs. So that's at least 75 million published journal articles. And then there's a lot more scholarly literature than that. So it's quite a big job. And over five years, really, they've put in a lot of work. They've got some full text access to the particular journals, but I know they're particular scientists who have been published. But not terribly satisfactory result because it's still not really very comprehensive and they can't be sure. And it's really a bit of a coat hanger and string job rather than anything terribly robust. The other problem is that there are lots and lots of data centers as well and they're also putting in exactly the same effort all over the place scouring, searching, trying to get access, trying to build tools. So this part of this chapter of the story comes to a close here in a little bit of an unsatisfactory sort of tone, really looking for needles in haystacks all over the world. In a parallel chapter, there's another story, a different journal here, often some remote part of the world. They have published another article, article A here, which is good, that's what journals should do. But people have said to them, wait a minute, this is really good research. We'd love to be able to get our hands on the data that underlies, that underpins the findings. We'd like to look at the models, we'd like to see what software, we'd like to see the data, but there's no mention of it in your journal. I think, right, we need to do something about this, we need to start contacting data centers. As it turns out, there is a data center in Australia where a data set has been deposited and they even mentioned the fact that this data set underpinned the research that was published in that journal and that journal article A. So that's the cruel and bitter irony of this, is that actually the data had been deposited somewhere, but the journal doesn't know about it. So the journal starts to think, right, we need to start establishing these bilateral relationships with a number of important data centers so that we can find out whether the journal articles that we have been mentioned in the descriptions of any data sets anywhere in the world. Again, this is even more sort of cottage industry because each data center has a slightly different interface, expresses the information about a link between a data set and literature in a slightly different way. This really is a hard slog and there's lots of data centers, centers obviously, and so that means there's a lot of these individual bilateral arrangements that need to be made to try and find again this needle in the haystack of which data center might have data that underpins our journal. And as we saw, there are lots of journals and lots of other publications around the place and they're all trying to do again all these, either not doing it because it's just too hard or if they do then they're replicating all these bilateral arrangements separately again. So the second chapter of our story comes again to a rather sort of unsatisfactory end where we've got a little bit of a view of the links between data and literature, but not really a very upbeat ending. Enter stage left scolics. Trying to move into that center area, a whole set of players in the scholarly communications world trying to see whether we could do this a little bit better. It's a working group sponsored by the Research Data Alliance and the World Data System. It has a number of publishers, peak publishing bodies, data centers, service providers, infrastructure providers. I've all come together in this working group to say, look, we really should be able to do something a little bit better here. The first steps that happen is to say, no, let's do all have at least a common idea of what's happening here and some common language. Really, what we're talking about is quite simple. There are two objects in the scholarly literature. One's a data set and one is a piece of literature. They are linked and there is a relationship between the two. And we get that information from some of the players in the scholarly system. So the working group starts to build up at least a set of common language so that we can start to attack this problem in a shared way. Now going back to that very messy exchange of information, as it turns out, some of the members of this working group are natural hubs for this kind of information. So Crossref collects all sorts of references from journals all over the world. Thousands of journals actually are providing information into Crossref about the references from journals. So there's already a kind of a natural community hub there in Crossref. Datasite was another of the members of the working group and they were already receiving information about data sets. And one of the pieces of information that you can provide to Datasite is a related identifier, which means a related piece of literature in lots of cases. So Datasite already have this relationship with hundreds and hundreds of data centers around the world and they're collecting that information. OpenAir is another example of a global aggregator of information from institutional repositories. And these institutional repositories do contain data and literature. And sometimes they know about the link between those data sets and literature or vice versa. So we already had some natural community hubs that could at least tidy up a bit of all of that Cross information. So the idea was if some of these natural hubs could then simply exchange information between them, then that would simplify all these one-on-one relationships that we saw earlier in the story. So that was what was proposed at least as a start-off. Now, of course, these are not the only communities in the world, and the SCOLICS initiative is open to new hubs and new communities who can bring their information in. But the idea was it's not that all the thousands of data centers and all the different journals in the world, they don't all need to be exchanging information when there are some hubs. If the hubs can just exchange the information, then that makes life easier for everyone. So they did agree, these community, and there's a SCOLICS link information package that they agreed on. And so there's now a way for these big community hubs to exchange information. I'm going to all the details of this. If you would like to become a hub, then join the Working Group, and you can get all this information. But basically it's the very minimal information about the two objects, the source object and the target, one, for example, being a journal and the other being a data set. And you give some basic information about the two objects, and then a little bit of information about the link itself. So that was part of the workings of this Working Group to agree on an interchange format. Now, it can be interchanged in lots of different ways. Currently, they agree to exchange that through some very simple open APIs using JSON. So once that information can flow between the hubs, we were able to establish an aggregation of that information. It's called the DLI service, which stands for the Data Literature Interlinking Service. Our colleagues at OpenAir kindly did all the development for this. It's the first of an aggregation. All this information is open information, so we're encouraging lots of people to aggregate potentially for a domain or for a community. But this DLI service is the first of global aggregation of this information, and it's there really to push forward with a number of test-bent implementations. So the DLI service aggregates the information from those hubs. And so now we have a much tidier kind of architecture. So what does that mean now for those two very sort of unsatisfactory stories that we started with? One of those stories, do you remember, was that there was a data center that knew about the link between a data set and a journal article that had been published by this journal, but the journal didn't know about it. So what would that mean in this new world? Here's an example, a real world example of this. Scopus, this is a page from Scopus. Scopus is not a journal, but it is an abstract and indexing database that pulls information about journal publications. So here they pulled some information about a particular journal publication that had been published. Previously they, as nobody had any idea about what data was linked to this publication, and the journal certainly didn't know that. So there's no way that Scopus would have known that. So I've just added in a new entity here and abstracting an indexing database who've abstracted this, who've indexed the information about that journal article, but they still had no idea of whether there was any linking to the underlying data. So now what's possible is for them to fire off a query to that service and find out, actually, there is a link to a data set somewhere. And now we can provide that link. So now what happens when you are in Scopus, as they load this page up, they fire off that little query to the DLI service. It's based on the business and identifier of the journal article, the DOI. They just fire it in and say, do you know, are there any data sets that are related to this journal article? And as we saw, the response came back and it said yes. There it is. So now there's a little information panel in Scopus that has the title of this data set, and you can click on it and go back to the University of Adelaide. So that's a much happier ending to that little story. And as you can see, it's a new little panel that's on every appropriate page within Scopus. Do you remember the second story? And that was where the journal published an article. There was a reference to some data, but the original data center actually didn't know about that. So how would that look under this new arrangement? So I'll give the example here of a data center in GenBank. They've done a big arrow there because they're very modest about how they market themselves here. So this is a GenBank page for a gene sequence. That's as much as I know about this being a linguist. It all looks like gobbledygook to me, but it's something about mice. So there's a gene sequence here for something about mice. Now, the important thing to note here, even if we don't understand what it's all about, is that there is an identifier for this data, this sequence. It's the reference sequence number, NM010186.5. So under our new arrangements, the data center that I've got portrayed over here, they've got this data about this gene sequence, but they're not 100% sure of which published literature has a reference to it. So they can now send that, this is an identifier across, and the reply comes back saying, yes, actually, we do have a journal article that has referenced that sequence. So there it is. It's in the Journal of Veterinary Immunology and Immunopathology. And there are two references to that sequence in there. So that, again, is a much happier ending. So if we think that is a happy ending, then we're now encouraging people to, so I should pause here and say that these have been pathfinder projects and test beds and part of the working group. I think there's a pretty good model there that can work. And the first step is to get a bit more coverage into the SCOLICS information ecosystem. So how do we get information in there? The good news is, do you remember we said that the individual data centers, the thousands of them and the thousands of journals, et cetera, no one needs to change what they're doing there. If you have, for example, a relationship with data site, then you just simply need to add this little piece of information to the data site metadata. So related identifier, there's the DOI, and that will be included into the SCOLICS ecosystem. The information on that is in the data site schema and the link is down at the bottom there. If you're a journal, again, you don't need to do anything different from what you're doing now. You're already giving information to Crossref in all likelihood. And there are a couple of different ways in which journals can give references to Crossref. The bottom one is a standard sort of citation format. The top one is a new thing called related item, which allows you to give a slightly more richer view of what the related item is. And as you can see, it's got the identifiers and a little description. But this is all standard, the thing to note here, this is all standard Crossref exchange metadata. There's no new language, there's no new information in the pathway either, that you just use the existing pathway that you have to Crossref. Again, there's more information on this from Crossref. How do you deposit your data citations is a very helpful blog on that, and you can visit the bottom of that slide. I won't go over, the same thing applies for open air and the institutional repositories. What I will just mention is if you're in Australia, there is a shortcut method that we have, because ANZ has been part of this working group. The ANZ Research Data Australia service is a mini Scolics hub. And all the data set collection descriptions that ANZ has in Research Data Australia come into the Scolics information world if they have a related publication. So here's another way, a very easy way for Australian repository and data center managers to do this, just include here a related information type of publication using the standard RIFCS exchange that we use with all the data centers in Australia, and you can include the identifier, our titles and notes, et cetera. And so you can get some pretty nice information in there, absolutely free of charge by the exact same route that you use for populating research data Australia. The URL for the information about related info and how to add that to your feed is down on this slide. And just to sort of go back, do you remember the journal article that appeared, so the data set that appeared in Scopus? It was this particular one from the University of LA, the molecular simulations of proteins and peptide absorption. So that was the exact thing that is currently being displayed in the Scopus page for the publication, and the information has come from this research data Australia page, and you see at the bottom there's a related publication, so ANS has a system that we just take that information and push it into Scolics. So I don't know whether there's anyone from the University of Adelaide watching today, but I'm hoping that they're pleasantly surprised that this information has just been syndicated in there. And that goes for all the providers to research data Australia in Australia. All right, so if you're not included in any of those parts, so potentially you don't have a relationship with Crossroads or a data site or ANS or OpenAir or et cetera, then there's always the option of becoming a hub yourself. And we'd encourage you to join the working group and collect information from your, so if you're a specialist astronomy data center, for example, you may not have a relationship with data site, you've got your own identifiers, you can just join the working group and start to expose that information. It will be aggregated and be part of this new ecosystem. So that's another option. Now, how to get information back from the Scolics ecosystem? That's another interesting question. Really, this is being developed as we speak now, so probably better that you join the working group if you want to do this, but I'll quickly go over that just so you've got the broader idea. There is the DLI website. You can go and type stuff in. It also has a set of APIs that are being developed. This particular Screga site will give you the information about the different methods that you can use. The ones that I showed here and that Stopus and other people are using these links from PID, so you provide a PID and it will return you with all the research objects that are related to that PID. Please join the working group if you're thinking of using that information because it's been optimized as we speak and your use case can help to design what those APIs and libraries can be. There's some stuff upcoming from both Crossref and DataSite. They're exposing their event data using this Scolics model. So those will be more community-focused queries where they'll only cover the DOI side of things. So that's where things are. I think that's a little bit of a happier ending to the story of linking literature and data. I should make it clear at the moment that these are pathfinder. It's not comprehensive. The aggregation of information is not comprehensive yet. It's not fully established, but we are levering some very established global infrastructure in DataSite, Crossref, OpenAir, and a number of operational data centers and journals around the world. So I think there is a really good model there that can be made comprehensive and can become an established part of the scholarly communication system. If you'd like any further information, then there's the Scolics website and Anne's also has some information about working with Scolics. So I will pause there and hand back to Natasha. Thanks very much, Adrienne. That was really great.