 I'd like to welcome Dr Jens Kump from the CSIRO, who's going to be taking us through both DOIs, Digital Object Identifiers, and IGSM as for Geological Sample Numbers. A few words to introduce myself. So my background is in geology. I did undergrad at the University of Cape Town in South Africa in geology and oceanography and then carried on to do marine geology back in Germany. And Raymond did a loop through IT industry and then joined the German Research Center for Geosciences in Potsdam in 2001, which well I stayed until beginning of this year and in March 2014 joined CSIRO as the OCE, which means Office of the Chief Executive, Science Leader, Earth Science Informatics. And my previous work was to support something which could be called the digital value or data value chain to understand how researchers work in geosciences and develop projects with them and then integrate the data that come in from heterogeneous sources, help people adopt new technologies and just to keep an eye of which of these new technologies can be used in the geosciences. So the main topic of today will be the DOIs and the IGSNs and I start with a more well-known topic of the digital object identifiers for data publication and citation and my focus will be on how this was intended to make data part of the record of science. That will be the guiding principle for the discussions from my perspective. So when the internet was still young, this was one of the errors that popped up very early on and has been around with us ever since and in terms of the record of science and having science on the web this meant that things were broken and that one of the terms was called link rot. So this problem of link rot was recognized very early on and this gave rise to the development of the handle system which was introduced in 1995 and based on the handle system the publishing houses introduced the digital object identifier. It was proposed in 1997 and went into production in 1998 and this was also the first time when somebody suggested that maybe these new digital object identifiers could also be useful for identifying data. So that started the project in Germany with the first DOI for data then being minted in 2004 so we're the 10 year anniversary now in the context of a project funded by the German Research Foundation but if you want to run this as a sustained business as a sustained service you have to find a business model that expands the use of these digital object identifiers to an international scale and international scale also meant that the original service provider, the German National Library for Science and Technology was a bit of a problem in that the French and the Swiss and other national libraries were uncomfortable with using a German national library as their service provider. So something else had to be found and that was data site. Data site was founded in 2009 as an organization to govern the system of digital object identifiers for data. So it has 31 members today. Last I looked there were 3.6 million data sets registered of which 1.2 million were in the last 12 months so there's been a huge upsurge in use but if you compare this to 1.8 million articles published in various fields of science in 2012 then the number of data sets that have been registered is actually pretty small and some of those data sets when you look at them you can if you google data site statistics you can see the fine-grained statistics and you will see that many of these data sets are very fine-grained. So even though 1.2 million 12 months sounds impressive it doesn't really reflect the output of papers published so it's still lagging behind but it's catching up fast. So the original idea was we have data in traditionally published in papers and on the left hand side that's a typical journal page. You have some tabulated data but the really interesting part is this illustration in the right hand column which some people call a buckshot scattergram with dots and disks and it nicely serves the purpose of illustrating the main idea of this paper of in this case calculated chlorophyll versus measured chlorophyll in Lake Baikal in Siberia but that's about where it stops. You cannot use this any further it just illustrates the thought. Fortunately in this case if you take the DOI at the bottom of the page it does resolve to something we put the data from this paper onto the scientific drilling database which is a database for data from scientific drilling projects and this repository then gives you a description of the data it gives you a way to cite the data at points to related materials like the the paper where this data are interpreted. You can download the data and you can also download metadata in various formats like ISO 19.1.1.5 for georeference data, NASA directory data interchange format which is goes back a bit further in time the data site metadata and for this particular system that underlies the scientific drilling database also the eSciDoc metadata but the point with metadata is that there are many ways to describe things and this particular system can cater to any description that you see fit or necessary to describe your dataset. So DOI for data the resolution question is solved so there's a way to go from a digital object identifier to the universal resource locator and this resolving service is provided by the handle service fine so we can find things that we can name but what do we name so that's the question of granularity what is the smallest identifiable object that we actually want to identify what exactly is identified by particular DOI how do we go about versions if we update things if we do corrections if we have IRATA and in anything that's in the environment particularly time series are very important how do we deal with continuing time series for instance from environmental monitoring those questions arose very early on in the in the precursor project to the DOI for data already and I think the principles haven't really changed some use cases have become a bit more refined but the underlying questions are still granularity identity versioning time series and this goes back very far the earliest source that I found was the year 75 Krutai describing the ship of these or its paradox in the ship of these paradox these always comes to port once a year with the ship and changes a plank or two and that's the same thing the next year and so on and in year n n number of planks of the ship have been changed so is these eyes ship still the original ship even though many of its components have changed and things become even further complicated when you collect the planks and build a second ship constructed out of these eyes ships old planks which of the ships is the original this is a question that's quite vexing and a solution has only what I have been many discussions started but the one of the solution that I found is actually not that old asking the question so can anything well these are questions can anything be identical with another object or are we looking for an equivalent identical object which in the case of digital object identifies is a good question of what are we actually looking for what is represented by the identifier and there's an aside to the these eyes paradox the ship of these are paradox formally this can be approached by introducing the concept of perdurantism and the perdurantists say that an individual has distinct temporal parts throughout its existence so the ship of these eyes at any year as its distinct temporal existence and over time it's identical with itself this is the antipode to the endurantism where the view is that an individual is wholly present at every moment of its existence and I think with many of the things we're dealing with the endurantist view produces some problems the perdurantist view is more pragmatic but that's something I will leave to the philosophers and I want to go through some of the use cases of digital object identifies so the first and easiest use case is the single item produced at time t0 and we're going back that t1 and t2 and it hasn't changed that's very easy we give it one name and we can always refer to it by that name and then we will have a resolving service that will then point us to its location now if we have a time series then it starts at t time t1 we can go back to it at t0 we can go back to it at t1 and then there's something that has been appended to this time series and when we go at t2 more has been appended but the past record hasn't changed and that's an important point the past record hasn't changed so we can go back to it introduce this use case to the project early on when we were looking at time series coming from satellites where the past record didn't change but the it was only appended and we in the old business model we would have to pay for the DOI so we weren't prepared to pay millions of dollars for DOIs we wanted to pay for just one DOI per dataset arguing that you could always go back to the old record it just became longer like in the days of library index cards where you have one library index card saying nature 18 something dash and you wouldn't have an index card for every issue of nature you would have only one for the series which was an ongoing series now if you update an item things are somewhat different so we start at t0 with some item then at t1 we update the item and at t2 we update it again indicated by different color bars in this box so there are some use cases where I want to go back to a very specific version of that dataset and each of these versions is identified by a different identifier we have the DOI one name the DOI two name DOI three but sometimes some people will only be interested in the most current version and that is something that you can approach by creating what we call a parent object and this DOIA name in this case is the parent to DOI one two and three and when you refer to DOIA it will take you to the most current version if you want a specific version you can address that very specific version by its own name the snapshot is a different variant of the same theme here it's a mix of updated item and time series it follows the same principle that you can go back to a very specific version or you can go to the current version depending on what your use case is and where you want to go also very useful is this concept of a collection where you have several objects several data objects that are then compiled into a collection by a parent object and this for instance can solve the question of how do I cite many hundred datasets in the publication I don't want a huge citation list you can create a collection and then cite that collection which then cites all the child datasets and this is an example of a collection you can also resolve that DOI this is a series of maps of the Lake Baikal region it's also a supplementary material to an article so we didn't actually put all the maps into the citation list at the end of the article we refer to this collection there are other examples of that as well you can find and this I think is very a useful and elegant solution to bundling many datasets into one collection and then referring to that as one and not as five hundred one of the things that then Pia is we have so far talked about collections and repositories but now we're approaching a where we're in an age where we're not dealing with handcrafted data in all cases anymore we're starting to use services and this is certainly a question that will be discussed I think also later today in in a workshop that how do we deal with this when we when we look at it from a services perspective if if it's file-based it's it's easy in that sense that it's easy to identify what we're referring to a pretty generic approach and it's close to the original record of science because we're working with something that's very close to the original materials and it's easy to make this compliant to the open archival information systems reference model but when we want to use this in the context of user agents of machines then this would often require manual interventions downloading data and then transforming it into something that machines can use on the other hand if we approach this from a services perspective that's machine friendly and the use of the data can be automated but the storage if we store things as a service is not OAAIS compliant does not fit with the open archival information systems model and the pangea database for instance has had run into that problem so now they have to keep a double record they have to keep the original materials that they receive plus use that as a stage to create the services that they then disseminate you can that's certainly a way that you can go to start file-based and then transform things into services but there's certainly an ongoing debate and things that need to be solved now the international geosample number is something that we develop based on the idea of the digital object identifiers with a slightly different use case and now we're touching on something that is now also called the internet of things the internet of things when you look at the Wikipedia definition is the internet of things refers to uniquely identifiable objects called things and their virtual representations in an internet like structure and in geology the specimens are one of the basic units of geoscience observations they are basic units for data reporting because measurements are being done on them and they are also basic units for data discovery access and analysis because people refer to them so creating access to this information about the samples or specimens is essential for evaluation and interpretation of the specimen based data and certainly it would be desirable to have access to the physical specimens to allow us to build more comprehensive datasets and reuse these resources and the data resources but until recently there was no standard way to access information about specimens there are few online repository catalogs there are very few disciplinary catalogs and the metadata found in the publications is incomplete if anything is reported to all just to illustrate the case this is the locations of rock specimens in the earth camp database called M1 so you can see that M1 is globally distributed there's a certain fondness for M1 in Japan and in terms of the rock type M1 is anything so M1 clearly is not a useful name to use it's something that if you find this in the literature it doesn't get you anywhere but even when the names are unique it doesn't help much if it's not linked to anything this is a case that I stumbled across when I looked at marine drilling data I wanted to run model for a sensitivity study and my colleagues told me you know the numbers you are using this study are they don't seem realistic where did you get them from I said I got them from the literature yeah but maybe they got them wrong so I meditated over this paper and this map and thought should I write to China now to ask for the numbers or should I find is there any other way that I could hold the numbers so I recognized SO 95-5 and SO 95-20 and thought maybe this was a site survey cruise for an international drilling ocean drilling program campaign so I checked the Pangea database to see whether the cruise SO 95 existed and it did so it gave me a lead to then go to the SEDIS database of the international ocean drilling program or integrate ocean drilling program at the time to see whether the drill holes 1146 1147 etc existed and yes they were there and I did find the data I was looking for and I could verify my claim that about these these numbers so having this in-depth knowledge of how ocean drilling works I could trace the numbers but it could have been so much easier if the materials had been identified by international geosample numbers the data were identified by DOIs at least so that there would be a way of looking those up if they had been reported but that's the word so there's room for improvement which could look like this so you start searching for something using your favorite search engine you find the paper that has a DOI to the data sets that are interpreted in this paper and then maybe these data will point you to other papers that are based or offer different interpretations maybe there's a more detailed data publication in the journal like earth system science data and point us to the materials that are the basis of these measurements that's how things I think will work in the future then we're getting there there are we are when well they're not we're not the only ones talking to the publishers then several initiatives are trying to make things more interconnected why didn't we use digital object identifiers for specimens because digital object identifiers means it's a digital object it's a digital identifier for an object it's not simply an identifier for a digital object so we could have used them for specimens but just historically the german national library for science and technology t.r.b. Hanover didn't want to do that because it wasn't part of their scope this was a really formal decision to go a different way and this was before the before data site was founded that they made that decision and so we went separate ways it could be discussed to merge the systems again at the moment we're keeping them separate because the use case of dealing with physical specimens called for a different set of rules even though the structures that we put in place are very similar to data site so the governance of these systems is a very important issue the technicalities are fairly simple if you want to base them on the handle service the infrastructure is basically there and then you build services that are technically actually not too demanding but to govern the system in a way that the names are unique that the names are persistent that the links are also as persistent as possible that is a different matter so since igsn and doi are both based on the handle system it will be easy to merge igsn with data site in the future if things go that way we'll see maybe that it will maybe things will go in a different direction