 Welcome to today's webinar, which is on the topic of identifying and linking physical samples with data using IGSN, the International Geosample Number. So let's get started. My name's Natasha Simons and I work with the Australian National Data Service and I'm going to be your host for the webinar today. My colleague Susanna Sabine is in Canberra behind the scenes and co-hosting the webinar with me. So this webinar will look at how you can reference physical samples online using a world standard globally unique persistent identifier scheme, the International Geosample Number, as well as discuss the international linking environmental data and samples symposium which was held last week at the CSIRO Black Mountain Laboratories in Canberra. This webinar is the second in a series examining persistent identifiers and their use in research. The first webinar we looked at citing grey literature using DOIs and the recording of this is available on the Ann's YouTube channel. The third in the series will look at how to link publications and data through the International Scholics Initiative. So I would also like to acknowledge the Commonwealth Government for their support of Ann's under the ANCRIS program. I'd like to introduce our speakers for today, Dr Leslie Wyborn, who's an adjunct fellow with the National Computational Infrastructure Facility and Research School of Earth Sciences at the Australian National University in Canberra, and Dr Jens Klump, who's OCE Science Leader, Earth Science Informatics for the CSIRO Middle Resources and based in Perth. I'll now hand over to our first speaker, Dr Leslie Wyborn. All right. So what I'm starting out with is I'm going to do the first part, which is about identifying samples. It's something that's dear to my heart because some of you know I used to be a field geologist and I have collected thousands of samples in my career. So how are we going to organise this is introduce the IGSN, identify for samples, outlining the application availability for researchers, and then Jens will hand over to the Global Picture and give an update on the symposium that was held last week. So what we want to do is, you know, what is it and how do you use it? How do you get an IGSN? And the science use case is fairly typical. Samples are the first class output of scientific research, in my opinion. A lot of our data ties back to samples and hints so do our publications. So why do we need unique identifiers for samples part one? And you can see on this map from Kirsten Nannet is that in the EarthChem database run by the NSF, you can see all the samples that are labelled M1 from everywhere and M probably stands for me, number one. And so what we find is we start to do aggregations of samples. This is a very common problem. A second problem that emerges is, again, I used to do a lot of analytical work and it was quite common to go and get a sample and from that sample, you did sample splits. And quite often the sample splits were given different numbers or was put on a different machine. It was given a different number or the sample was given to somebody else and they went and re-labelled it according to the in-house rules and repositories. And so here is all the different names for a highly valuable sample from a cruise in the Pacific. And so as we are moving now towards data aggregations, we really need to be able to uniquely identify samples and analytical data and publications derived from these samples. And this was the driver behind Kirsten Nannet in the US and a few others getting involved in trying to set up a unique identifier system for samples. So what the IGSN does is it provides persistent identifiers that guarantee to be unique for a hierarchical system. It facilitates internet-based discovery and access to physical samples, provides web applications and programmatic access to sample metadata catalogs and it helps networks with sample repositories and data centres. It ensures preservation of and access to sample data. It aids in the identification of samples in the literature and there is one that you can actually click on if you've got the time. So what could it be used for? IGSN stands for International Geoscience for Sample Number but increasingly we are finding it being used for water, biological materials and all sorts of things. We use for collections, gripping of samples or for a sample feature such as a borehole or an outcrop. And samples can be linked to each other through the related identifier metadata element. So that green thing down the bottom is actually a rock or a mineral called olivine and above it you can see that we often take mineral separates out of those rocks or we can create solutions. And so you bring that one thing from the field which probably costs you a fortune to collect. And through IGSN we can link the parent to all derived samples that come from it. It enables you to track the sample life cycle so in sorrow it's used for taking samples and to support sample logistics. So starting out in the field and we call it IGSN is really the birth certificate. You have unabiguous identification and metadata capture with the mobile app and it's been given the IGSN in the field. Then as we take that sample into the lab we can identify all the derivative processes and analytical techniques. We apply to that sample and tie the data to it. And then finally the sample goes into the repository and we can trace it through collections and samples in storage catalog and maintain sample logistics so when a sample is sent out to another museum or another institute. It's like I say it's like when you're born you get a birth certificate now what we're enabling is samples as they are collected to be given a birth certificate that goes through them with them through life. As Jens will explain it's based on the DOI data site and so what we can do now is we've got the specimen IGSN we then link it to the spectral results and finally we link it through to the publication. And this IGSN number has attracted a fair bit of attention and it is endorsed by the Coalition for Publishing Data in the Earth and Space Sciences and in Elsevier and from Pernicus Earth Science Journals you are encouraged to put the IGSN number on samples that you cite in the literature and so again in the digital age you can get interested in a particular specimen and trace through its history and anything else that has been done on that specimen. So as a system review what you do is you register a sample with what we call an allocating agent. The allocating agent then registers the sample with IGSN EV which is the International Implementation Agent as you can see here. There are three current allocation agents in Australia CSIRO, Jusons Australia and Curtin University. So what I'll do next is take you through how these agencies are using it in different ways. So CSIRO became a member of IGSN in 2013 and I currently use it for the repository of their cross-faceted research group over in Perth and it takes mineral rocks, synthetic materials in the Capricorn Distal Project, they're using it for water, vegetation, soil, rock and regalus and CSIRO is looking now to use it for their soils collection in Pambra and their insect collection. So that's why we kind of refer to it as the IGSN now and not the International Geo sample number because it certainly is getting big news more. For go to Jusons Australia they've got the second largest collection in the world registered samples, 1.6 million samples covering mineral separate rocks. In sections that's microslope slides of rocks and fossils. Jusons Australia also is about to be if they're not already the registration agent for the geological surveys and the states and territories. Curtin University has a different use case and there they're using it more as we mentioned earlier tracking samples, samples splits through the laboratory and I'd like to acknowledge Anne's because they sponsored the development of this project in collaboration with the Curtin University Library and the CSIRO Geological Survey in Western Australia are actually working together with this. So as I said we've got the three agents. Curtin is only operating for Curtin University and we're hoping to expand that so it can become more available to the rest of the research community. And so again we were able to get some more funding from INCRESS through the research data services project and they made it possible to develop a demonstrator for a common geosample portal which you can actually see here. And so metadata from the three agents is harvested in your common metadata portal to discover samples created by any Australian IGSN member. And Australians have agreed to a common metadata schema even though you've got quite a diversity of samples and so if you are hoping that as this grows if you want to find any information about a physical sample this will be the place you go. I want to hand over to Jens who will take it into a global perspective and also discuss some of the technical issues and the results from the symposium that we held last week which was about trying to actually extend this into the environmental areas that is from its original intent within the Solid Earth Sciences. Thanks Lesley. I'd like to start with saying may all your problems be technical. Usually technical problems can be overcome. It's also a whole social network behind technical solutions and this is where the global perspective comes in. So the IGSN implementation organization is the body that we created to carry this on the global stage. It's an organization, it's a charitable organization incorporated under German law registered in Potsdam in Germany. At present it has 19 members on four continents. The governance model Lesley already mentioned is a so-called hierarchical model, hierarchical delegation. You can think of it in the way that you assign IP numbers and the internet and the network. And the IGSN identifies themselves are registered through the IGSN agents. And to make sure that there's no overlap in numbers each IGSN agent is given a so-called namespace for registrations of IGSN. As an example, all IGSN registered by Geoscience Australia start with AU. And then after that it's up to Geoscience Australia to make sure that these identifiers are unique. CSIRO starts identifiers with CS. So we delegated some of that to the Capricorn Distilled Footprints Project and gave them a namespace CS CAP. And then after that, that's their responsibility to make sure the names are unique and they don't interfere with any other projects or infrastructures in CSIRO that are using IGSNs. Technically, IGSN builds on an existing technical base and community, the data site model. So we basically grown data site to use the technical base, which is ultimately based on the handle system for persistent identifiers. But also a lot of the governance and how this is run is based on the example of data site. And we work with them very closely also to see that we align our technical architectures to make the collaboration and interlinking as easy as possible. There's two links here to our technical documentation and to our code repositories on GitHub. So the status of IGSN on the global scale is it's still work in progress. But we have active registrations, agents in Australia, CSIRO, KERF University, Geoscience Australia, but also GFZ Potsdam, the German Research Centre for Geosciences, the Data Centre for the Earth and Environmental Sciences at Columbia University, the Data Centre for Marine and Environmental Sciences at the University of Bremen. And then sometimes it's difficult for government institutions to join another foreign organization. So the German Geological Survey, BGR and the US Geological Survey have some technical or legal issues to join this organization. So they register IGSNs by proxy through other allocation agents. The interesting story we're hitting now is that we're not only identifying samples within one institution, but we are now moving samples between institutions. And this is where the real value of IGSN becomes visible. Leslie already mentioned the case of the John Delator Centre at Curtin University. And here they have adopted IGSN, so in this case, if the sample has an IGSN, it will be carried through the process and any data that come from the analytical processes are linked with this already existing identifier. If the sample is not yet identified by the IGSN, the John Delator Centre assigns it an IGSN. The other case is subsampling. Leslie mentioned that already that sometimes you take subsamples. And here it becomes a bit more complicated. It depends on where this is done by whom and where the subsample then resides. So I won't go into the details now because that is something that needs to be discussed for the particular use case. The important point here is that any subsample should be identified with it by its own IGSN to make it uniquely identifiable as well and then link it to its parent sample. So what's happening next? What we saw at the symposium last week is that we have already made good progress building a developer community around IGSN, but that needs to carry on further. We document best practices to show how it can be used and also build reference implementations of services that others can test their services against. And the next steps which we are already taking is expanding to identify and linking objects in other domains, not only in the geosciences, but ultimately what we want to see is that other domains start reusing the IGSN technology. So maybe not IGSN in the strict sense through the existing organization, but as we copy data site, other domains might copy IGSN as technology and governance model for persistent identifiers in their specific domain. So that's what I want to say about IGSN from my side and I want to give you a brief report back on the symposium we had last week called Linking Environmental Data and Samples. This was a cutting edge sign symposium which got its seed funding from the CSIRO Research Plus Office and the goals were to bring international researchers leaders to Australia and also to provide a forum for early career researchers to engage with others and with the international experts. So we have a web page and it's probably easiest to note the short link, the Google short link, but we also, besides the seed funding from CSIRO, received sponsorship from other organizations, the Australian Bureau of Meteorology, Geoscience Australia, the US National Science Foundation, the Earth Science Information Partnership in the US, NCI, and also Living Australia, OSCO and TURN, which those organizations mainly fund the travel for international experts. What we discussed at this symposium was the science drivers, why are we interested in linking anything with something through semantic web technologies and that's because we have a rich resource of samples that support scientific investigation and we wanted to discuss, and we did discuss, how we link these to the data sets that were derived from the samples and then how can we link samples and data to the literature where the samples are interpreted and put into context. And last but not least, how can we include machines as users? Why do we want to do that? Because our body of data information knowledge is growing at a much faster pace than any of our minds can comprehend and machines can be very helpful in trying to find things in these very rich holdings. To me, it was also an important point not only to discuss the theory and the future perspective of linked data but also to look at the solutions. Can we get it to work? So we discussed what is the role of infrastructure for building the linked data federation and how can we support the evolution of linked data. What we saw is that heterogeneity is inherent and we have to have mediation mechanisms. We cannot build one thing for all and this raises one question that is how precise do terms need to be because the commonly held wisdom is that computers don't understand ambiguity so you have to be 100% precise but that we cannot achieve. So we have to suffer some degree of imprecision but vocabularies that are useful will be adopted. That is something that we can already see. But to distill what is the essence is that we have a fabric of science where we ask which elements produce output. We have a process of science, how are these outputs produced and we have a language of science. How do we describe these elements? This is something that needs to be where we have to find solutions at different levels in the linked data framework. As I mentioned at the start of my part of this presentation there are social and community factors to get things like this working and Paul Box from CSIRO, Landon Water said that the greatest organization effectiveness is achieved when technology systems fit social systems. So it's not that you build it and they will come but it has to support the this processes that already exist that will make it more likely to have success. Certainly questions who bears the cost and what would be the incentive to contribute. And we've been building things in this domain for a while and so it was also very useful to discuss the fail patterns and we identified two major fail patterns. The one is the anti-life of Brian pattern that I am different. So that's why I have to do things differently that can lead to failure. And the other one is the too big to fail pattern where something should have ended long ago but we've invested too many resources and so everybody is embarrassed to pull the plug. It would have been better to allow things to fail quickly and start with a fresh view. Thanks very much Jens. Thanks Jens and Leslie just while people are thinking about questions perhaps you could give us an idea of how many IGSNs have been assigned and what types of samples those have been assigned to. So in the global total we're approaching 6 million IGSNs. Most of those are geological materials but we also have an increasing number of water, plant materials, soils and also places like a borehole is not an object. It's something else but since the material that's coming from a borehole is very tightly coupled through that feature we also identify that feature. Okay researchers are generally a bit more familiar with DOIs. Can you put forward a few arguments that you would put to research that why they would select an IGSN over a DOI? The reason there's one historical reason that when we decided to go for a global system a couple of years back that was before data site existed and TIB Hanover was running the show they pushed us back and said it is a really great idea but it's out of scope so we went our own way and in that we discovered that there are specific governance issues in how we create these identifiers and a resolving mechanism and what metadata we use that are quite specific and not well covered by the more bibliographic world of DOI but DOI data sites are changing, they're changing the business model, they're changing the way things are run so we are in the conversation and let's see how things develop in future. The systems are technically compatible so maybe we will verge one day. Okay so watch this space. There's a question from Josh Brown, other than the international legal issues what barriers are there to IGSNs being adopted? The legal issues are actually a very specific problem to government institutions that cannot easily join foreign organizations. The main problem to adoption is that it needs to be introduced into workflows and so that changes how people work and that is in my view the main barrier to adoption that yeah people have to do changes to what they do and that usually they're not, they're busy enough so they don't want any extra work so we have to make this simple or provide other good reasons to make it worthwhile. Another issue too is that if you've got an organization that's fairly well set up and has an internal system that guarantees unique identifiers then if your organization registers it is a reasonably simple process. What we have noticed is organizations that are full of M1, M2, M3, you know people using repeated numbers and they don't have an internally consistent unique identifier or number, not identify but number then they do struggle a bit to introduce this much more complex and that's why the surveys were fairly good at this because they had unique systems. So that's a very good point because in the case of Geoscience Australia we just had to put AU in front of their numbers and it was done. Okay we're coming up to time, there's one other question there, how does Curtin University Library support IGSN? Um, his name's Matias, oh I've forgotten that somebody can help me, Matias, yeah, and I would suggest you go and talk to him but the Curtin University Library has been very supportive of this whole project. Yes so Josh, sorry John Brown has made a good point that Matias is now at UWA, so maybe someone else at Curtin University that we could share perhaps when I do the follow-up email if there's some contact details perhaps for each of the allocating agencies that would be useful if I could share that in the email for people. Brett McKinnis from Curtin University would be the best then because he kind of runs the project in collaboration with Ann's and the Library. Okay and John Brown is at Curtin University, John Brown's on the call now so I think he's saying that you could talk to him if you wanted some information so I'll check in with you John after after the webinar and sorry one other question, are there competing IDs in this space or does this look like this will be the gold standard? Not in the Geoscience space but there certainly are others in the other areas and one of the while we're being sort of open about IGSN is that it has been one of the more successful ones and we often wonder why and we think it's because of one it has a very good governance structure and secondly it's compatibility with data site and DOIs. Ann's would you like to add anything to that? It was an interesting discussion during the symposium last week where we had quite a number of people from the biodiversity world who had tried to introduce a life science identifier over the past 10 years but then that system was quietly buried recently because adoption was just too hard it was technically immature and it did not have governance structure that made it easy to apply so the biodiversity world is now discussing how to proceed. And there may be more information about IGSN's published shortly too we hope Jens? Yes we're working on an overview paper to describe how the system is set up from an organizational perspective and science use and then there will be a separate paper outlining the technical implementation. A question from Will can this system be used for samples that are not able to appear I think it's supposed to be in a public catalog? Yes it can so you can think of this in the same way as DOI are being used when you resolve a DOI it doesn't always get you to the object that you are like in a paper most papers are not publicly accessible you have to have a subscription so there are good reasons why you don't want to disclose the details of a sample to the public that can be rare species or a sensitive site and so the only rule is it has to resolve something but what you want to disclose is up to your discretion. Leslie do you have anything to add to that? It's just that certainly as I said we have a lot of fossil fossils in this and I can assure you the locations of many of those fossils in certain organizations are not publicly available but you do know there so have been a fossil collected from our springs but only certain people who are qualified will get what that specific location is the system does definitely have safeguards around that okay oh the other thing I wanted to add is maybe of interest to people listening is that it's not just for land-based specimens in the US it is widely used for the ocean drilling program and for marine samples as well and once we get into marine areas because a lot of it is bio samples so it's just kind of organic the way it's starting to grow into other areas because as Jens said the life sciences identify system collapsed and people just see the need for having unique identifiers for their samples and this is what's happening. So related to that the next questions are you aware of anything similar for pathology specimens? I'm not Jens what about you? I have read about identifiers for cell cultures but I'm not aware that they have this kind of resilient resolving mechanism and that's also just to pass on that's the basis of what we were doing last week and why groups like GBIF and TADWIG are getting interested in what we've got it's that core kernel that applies to the registration of a sample with the group in Germany but that you can then go into a next layer out that with a metadata it's more in tune with a rock sample or something else or a plant sample you know the communities develop their own additional metadata but it's that core component that is the bit that can be cloned for other groups if they so want to. Okay well that brings us to the end of the questions so thanks everyone for attending today's webinar and thanks very much to Jens and Leslie for their time in sharing that there was a lot of good discussion a lot of interest around iGSNs which I think we'll have to follow up through the email after the webinar and as I mentioned earlier this is actually a series on persistent identifiers and the third one is on linking publications and data so you can find out about that the webinar series through the ANS website or subscribing to ANS news so thanks for coming everybody and bye thank you very much for having us thank you yeah