 I'd like to welcome Dr Jens Kump from the CSIRO, who's going to be taking us through both DOIs, Digital Object Identifiers, and IGSC, as for geological sample numbers. A few words to introduce myself. So my background is in geology. I did undergrad at the University of Cape Town in South Africa in geology and oceanography, and then carried on to do marine geology back in Germany. I did a loop through IT industry and then joined the German Research Center for Geosciences in Potsdam in 2001, which I stayed until the beginning of this year. And in March 2014 joined CSIRO OS, the OCE, which means Office of the Chief Executive, Science Leader, Earth Science Informatics. My previous work was to support something which could be called the digital value or data value chain, to understand how researchers work in geosciences and develop projects with them and then integrate the data that come in from heterogeneous sources, help people adopt new technologies, and just to keep an eye of which of these new technologies can be used in the geosciences. So the main topic of today will be the DOIs, the IGSNs, and I start with a more well-known topic of the Digital Object Identifiers for data publication and citation. And my focus will be on how this was intended to make data part of the record of science. That will be the guiding principle for the discussions from my perspective. When the internet was still young, this was one of the errors that popped up very early on and has been around with us ever since. And in terms of the record of science and having science on the web, this meant that things were broken and that one of the terms was called link rot. So this problem of link rot was recognized very early on and this gave rise to the development of the handle system, which was introduced in 1995. And based on the handle system, the publishing houses introduced the Digital Object Identifier. It was proposed in 1997 and went into production in 1998. And this was also the first time when somebody suggested that maybe these new Digital Object Identifiers could also be useful for identifying data. So that started the project in Germany with the first DOI for data then being minted in 2004, so we're the 10-year anniversary now. In the context of a project funded by the German Research Foundation, but if you want to run this as a sustained business, as a sustained service, you have to find a business model that expands the use of these Digital Object Identifiers to an international scale. And international scale also meant that the original service provider, the German National Library for Science and Technology, was a bit of a problem in that the French and the Swiss and other national libraries were uncomfortable with using a German National Library as the service provider. So something else had to be found and that was DataSide. DataSide was founded in 2009 as an organization to govern the system of Digital Object Identifiers for data. So it has 31 members today. Last I looked, there were 3.6 million datasets registered of which 1.2 million were in the last 12 months. So there's been a huge upsurge in use. But if you compare this to 1.8 million articles published in various fields of science in 2012, then the number of datasets that have been registered is actually pretty small. And some of those datasets, when you look at them, you can, if you Google DataSide statistics, you can see the fine-grain statistics and you will see that many of these datasets are very fine-grain. So even though 1.2 million 12 months sounds impressive, it doesn't really reflect the output of papers published. So it's still lagging behind, but it's catching up fast. So the original idea was we have data traditionally published in papers and on the left-hand side, that's a typical journal page. You have some tabulated data. The really interesting part is this illustration in the right-hand column, which some people call a buckshot scattergram with dots and disks. And it nicely serves the purpose of illustrating the main idea of this paper of, in this case, calculated chlorophyll versus measured chlorophyll in Lake Baikal in Siberia. But that's about where it stops. You cannot use this any further. It just illustrates the thought. Fortunately, in this case, if you take the DOI at the bottom of the page, it does resolve to something. We put the data from this paper onto the scientific drilling database, which is a database for data from scientific drilling projects. And this repository then gives you a description of the data. It gives you a way to cite the data at points to related materials, like the paper where this data are interpreted. You can download the data, and you can also download metadata in various formats like ISO 19.1.1.5 for georeference data, NASA directory data interchange format, which goes back a bit further in time, the data site metadata. And for this particular system that underlies the scientific drilling database, also the ESIDOC metadata. But the point with metadata here is that there are many ways to describe things, and this particular system can cater to any description that you seem fit or necessary to describe your data set. So DOI for data, the resolution question is solved. So there's a way to go from a digital object identifier to the universal resource locator, and this resolving service is provided by the handle service. Fine. So we can find things that we can name. But what do we name? So that's the question of granularity. What is the smallest identifiable object that we actually want to identify? What exactly is identified by particular DOI? How do we go about versions? If we update things, if we do corrections, if we have IRATA, and in anything that's in the environment, particularly time series are very important. How do we deal with continuing time series, for instance, from environmental monitoring? Those questions arose very early on in the precursor project to the DOI for data already. And I think the principles haven't really changed. Some use cases have become a bit more refined, but the underlying questions are still granularity identity versioning time series. And this goes back very far to the earliest source that I found was the year 75, Kutai describing the ship of Dizoy's paradox. In the ship of Dizoy's paradox, Dizoy's comes to port once a year with the ship and changes a plank or two, and that's the same thing the next year and so on. And in year N, N number of planks of the ship have been changed. So is Dizoy's ship still the original ship, even though many of its components have changed? Things become even further complicated when you collect the planks and build a second ship constructed out of Dizoy's ship's old planks, which of the ships is the original? This is a question that's quite vexing and a solution has only, well there have been many discussions started, but one of the solutions that I found is actually not that old, asking the question, so can anything, well Dizoy's question is can anything be identical with another object? Or are we looking for an equivalent identical object, which in the case of digital object identifies is a good question of what are we actually looking for? What is represented by the identifier? And as an aside to the Dizoy's paradox, the ship of Dizoy's paradox, formally this can be approached by introducing the concept of perdurantism. And the perdurantists say that an individual has distinct temporal parts throughout its existence. So the ship of Dizoy's at any year has its distinct temporal existence and over time it's identical with itself. This is the antipode to the endurantism, where the view is that an individual is wholly present at every moment of its existence. And I think with many of the things we're dealing with, the endurantist view produces some problems. The perdurantist view is more pragmatic, but that's something I will leave to the philosophers. I want to go through some of the use cases of digital object identifies. So the first and easiest use case is the single item produced at time t0 and we're going back that t1 and t2 and it hasn't changed. That's very easy. We give it one name and we can always refer to it by that name and then we will have a resolving service that will then point us to its location. Now if we have a time series, then it starts at time t1, we can go back to it at t0, we can go back to it at t1 and then there's something that has been appended to this time series. When we go at t2, more has been appended, but the past record hasn't changed. And that's an important point. The past record hasn't changed, so we can go back to it. Introduce this use case to the project early on when we were looking at time series coming from satellites where the past record didn't change, but it was only appended. And in the old business model we would have to pay for the DOI, so we weren't prepared to pay millions of dollars for DOIs. We wanted to pay for just one DOI per dataset, arguing that you could always go back to the old record. It just became longer, like in the days of library index cards where you have one library index card saying nature, 18 something, dash. And you wouldn't have an index card for every issue of nature. You would have only one for the series, which was an ongoing series. Now, if you update an item, things are somewhat different. So we start at t0 with some item, then at t1 we update the item and at t2 we update it again, indicated by different color bars in this box. So there are some use cases where I want to go back to a very specific version of that dataset. And each of these versions is identified by a different identifier. We have the DOI one name, the DOI two name, DOI three. But sometimes some people will only be interested in the most current version. And that is something that you can approach by creating what we call a parent object. And this DOI A name in this case is the parent to DOI one, two and three. And when you refer to DOI A, it will take you to the most current version. If you want a specific version, you can address that very specific version by its own name. The snapshot is a different variant of the same theme. Here it's a mixture of updated item and time series. It follows the same principle that you can go back to a very specific version or you can go to the current version depending on what your use case is and where you want to go. Also very useful is this concept of a collection where you have several objects, several data objects that are then compiled into a collection by a parent object. And this for instance can solve the question of how do I cite many hundred data sets in a publication. If I don't want a huge citation list, you can create a collection and then cite that collection which then cites all the child data sets. And this is an example of collection. You can also resolve that DOI. This is a series of maps of the Lake Baikal region. It's also a supplementary material to an article. So we didn't actually put all the maps into the citation list. At the end of the article we refer to this collection. There are other examples of that as well you can find. And this I think is very useful and elegant solution to bundling many data sets into one collection and then referring to that as one and not as five hundred. One of the things that then appear is we have so far talked about collections and repositories. But now we're approaching where we're in an age where we're not dealing with handcrafted data in all cases anymore. We're starting to use services. And this is certainly a question that will be discussed I think also later today in a workshop that how do we deal with this when we look at it from a services perspective. If it's file based it's easy in that sense that it's easy to identify what we're referring to. A pretty generic approach and it's close to the original record of science because we're working with something that's very close to the original materials and it's easy to make this compliant to the open archival information systems reference model. But when we want to use this in the context of user agents of machines then this would often require manual interventions downloading data and then transforming it into something that machines can use. On the other hand if we you approach this from a services perspective that's machine friendly and the use of the data can be automated but the storage if we store things as a service is not OAAIS compliant. It does not fit with the open archival information systems model and the Pangea database for instance had run into that problem. So now they have to keep a double record. They have to keep the original materials that they receive plus use that as a stage to create the services that they then disseminate. That's certainly a way that you can go to start file based and then transform things into services but there's certainly an ongoing debate and things that need to be solved. Now the international geosample number is something that we develop based on the idea of the digital object identifiers with a slightly different use case and now we're touching on something that is now also called the Internet of Things. The Internet of Things when you look at the Wikipedia definition is the Internet of Things refers to uniquely identifiable objects called things and their virtual representations in an internet like structure and in geology the specimens are one of the basic units of geoscience observations. They are basic units for data reporting because measurements are being done on them and they are also basic units for data discovery access and analysis because people refer to them. So creating access to this information about the samples of specimens is essential for evaluation and interpretation of the specimen based data and certainly it would be desirable to have access to the physical specimens to allow us to build more comprehensive data sets and reuse these resources and the data resources. But until recently there was no standard way to access information about specimens. There are few online repository catalogs, there are very few disciplinary catalogs and the metadata found in the publications is incomplete if anything is reported to all. Just to illustrate the case this is the locations of rock specimens in the EarthCamp database called M1. So you can see that M1 is globally distributed there's a certain fondness for M1 in Japan and in terms of the rock type M1 is anything. So M1 clearly is not a useful name to use it's something that if you find this in the literature it doesn't get you anywhere. But even when the names are unique it doesn't help much if it's not linked to anything. This is a case that I stumbled across when I looked at marine drilling data I wanted to run a model for a sensitivity study and my colleagues told me, you know the numbers you are using in this study they don't seem realistic. Where did you get them from? I said I got them from the literature. Yeah but maybe they got them wrong. So I've edited over this paper and this map and thought should I write to China now to ask for the numbers or should I find, is there any other way that I could hold the numbers? So I recognized SO95-5 and SO95-20 and thought maybe this was a site survey cruise for an international ocean drilling program campaign. So I checked the Pangea database to see whether the cruise SO95 existed and it did. So it gave me a lead to then go to the SEDIS database of the international ocean drilling program or integrated ocean drilling program at the time to see whether the drill holes 1146, 1147, etc existed and yes they were there. And I did find the data I was looking for and I could verify my claim about these numbers. So having this in-depth knowledge of how ocean drilling works I could trace the numbers but it could have been so much easier if the materials had been identified by international geosample numbers. The data were identified by DOIs at least so there would be a way of looking those up if they had been reported but unless there were so there's room for improvement which could look like this. So you start searching for something using your favorite search engine you find the paper that has a DOI to the data sets that are interpreted in this paper and then maybe these data will point you to other papers that are based on different interpretations maybe there's a more detailed data publication in a journal like Earth System Science data and point us to the materials that are the basis of these measurements. That's how things I think will work in the future. We're getting there. We're not the only ones talking to the publishers. There are several initiatives trying to make things more interconnected. Why didn't we use digital object identifiers for specimens? Because digital object identifiers means it's a digital object it's a digital identifier for an object it's not simply an identifier for a digital object so we could have used them for specimens but just historically the German National Library for Science and Technology TRB Hannover didn't want to do that because it wasn't part of their scope this was a really formal decision to go a different way and this was before the data site was founded that they made that decision so we went separate ways it could be discussed to merge the systems again at the moment we're keeping them separate because the use case of dealing with physical specimens called for a different set of rules even though the structures that we put in place are very similar to data site so the governance of these systems is a very important issue the technicalities are fairly simple if you want to base them on the handle service the infrastructure is basically there and then you build services that are technically actually not too demanding but to govern the system in a way that the names are unique that the names are persistent that the links are also as persistent as possible that is a different matter so since IGSN and DOI are both based on the handle system it will be easy to merge IGSN with data site in the future if things go that way we'll see, maybe things will go in a different direction one of the use cases where all of this came into production was the TIRINO Terrestrial Environmental Observatory where I will give you a brief introduction of what it was about it had obviously to do with environmental monitoring it's an infrastructure initiative by the Helmholtz Association in Germany to provide environmental monitoring infrastructure for the scientific community so it's an infrastructure in place that people can then group projects around to use that infrastructure construction started in 2008 and operations plan to run for 25 years it's subdivided into four regional observatories and I was involved with the TIRINO Northeast Partial or Sub-Observatory which has eight study sites 32 platforms and until earlier this year 35 million data entries from various sensors and more platforms being added and the other three regional observatories are of similar scale the idea behind this is that with climate change there will be some areas in Germany that will be more effective than others and affected in different ways these areas of vulnerability have been identified and the regional observatories cover these areas so they're located in the Alps and Pre-Alps in the western mountains looking at river catchment in the central mountains and lowland and then in the north-eastern lowlands then the idea here is to look at the interactions and feedback between different compartments and our ecosystem, the atmosphere, the terrestrial biosphere and the terrestrial hydrosphere and pedosphere so the water and soils but also looking at different scales basically cubic centimetre scale all the way up to whole river catchments and to bridge the gaps between these different scales in the northeast this is fairly spacious spread out over the north-eastern lowlands which is an interesting area from an ecosystem development perspective because it used to be heavily farmed in the middle ages where you can see this cross-section of a soil horizon in the lower right-hand corner where there's a medieval soil covered by windblown material from later times or younger times because actually since the middle ages the area has been decreasingly depopulated and very quickly depopulated in the past 20 years so it's very interesting to see how things changed from intensive agricultural use to almost natural park national park like situations today a particular trait of the northeast observatory is the use of geo-archives this means looking at lakes and trees as long-term archives in the past to then look at processes that happened decades or even centuries ago this means there's a lot of data coming together from four different observatories that is collected into a common catalog but the data are held in the four local systems so the catalog then has to point back to the local databases and the central portal should also help not only to discover data but also to a visualization and access and allow you to download data in the case of the northeast observatory this is the more detailed system architecture and the type is a bit too small to read on the screen but basically it has two parallel branches it has the branch on the left-hand side which is file-based to keep a record of science and it has a branch on the right-hand side for the services the data come in at the top from the sensors in the field by FTP basically over mobile phone networks and collected on an FTP server and when the data import tools recognize that something new has arrived they start a workflow to import data to start with the left-hand side the data are imported into the data infrastructure data storage infrastructure which could also incorporate external data sets and this data storage infrastructure has a metadata editor front-end metadata are mostly added as part of the import process because at the time when they arrive we know what they are so they can automatically be annotated but sometimes the metadata might need some editing so that is what can be done at that stage and then all the different metadata records are harvested transmitted over the OAPMH protocol Open Archives Initiative protocol for metadata harvesting into the geo-network portal software but in the case of this system geo-network serves only one purpose to the translation from OAPMH to CSW the catalog web service so that the catalog entries can then be served to other metadata portals based on OGC standards like the central Torino portal or the German federal data infrastructure or any other metadata portals On the right-hand side is the services and there are some other processes going on that's for instance looking at format transformations transforming things from the original formats they would deliver it in to things that can be used in the services also some initial quality checks that then trigger email alerts to the scientists responsible for certain time series that they should have a look that maybe the sender is broken or something else went wrong and that is then used as a stage to feed data into a post-press database the data model that we used here was the Kuazi data model that's the US hydrological data model we use that because Torino is very much hydrology and Kuazi is a very active community working with those data and has developed a lot of tools and Kuazi standards to deal with these kind of data but one of the requirements in setting up the system was that we had to provide a sensor observation service to serve data and allow users to query the data sets so we use the 52 degrees north SOS data server which has a different data model than the Kuazi data model and to get from Kuazi to S52 north we created views for the views onto the Kuazi model to be conformable with the 52 degrees north data model and that is the sensor observation service that then serves the Torino data portal to all other OGC clients as the screenshot looks like this you can search things in the geographical context and then look at the time series download them, filter them, whatever this gives a good first overview but this is certainly only the starting point to then hook up your OGC compliant client and start working with the data where do I think is this all heading and the proven research is certainly one of the buzzwords around and is now hitting the geological sciences with some delay because the geological sciences getting your hands and data is quite difficult and one of the things going on from DOI is I think will be identified for software it's starting already, Syro is already assigning DOIs to software and this is something that I think is very necessary because similar to data and specimens also software should be identifiable in this persistent way that would create the now missing link between papers and data because then we could understand how the data were processed and interpreted it would also make software recognizable as a scientific achievement which is a gap at the moment something that's not always recognized that creating software is a contribution to science and it would make science more transparent and reproducible so assigning DOI to software is a good start but might not be enough we would also have to think about other questions again the questions of identity versioning or location repository to identify what we are referring to when we say we have an identifier for software then sensor networks are becoming more important in the geological sciences than they had been a few years ago and these sensors can be manifold they can be drilling grids they can be satellites they can be measurements in the field they can be drones or they can be instruments in the lab and at the moment these different subsystems are not well integrated and the ability of creating metadata as the data are being created that that ability is not used to its full potential then with more sensors around we also have more data so we have to find ways of working with very large data sets that are too large to be inspected in detail or even to be loaded into the desk properties a lot of data sets they can easily download from the web are already too big to be handled in your standard desktop software and sometimes the question could be with time series how do you inspect three years of meteorological radar for anomalies you cannot sit down and watch three years of rain radar and also the process of data mining today is mainly numerical and text data but maybe we want to work more with images and quite different other materials not only numbers and characters so with these challenges this also means that processing will have to move from the desktop to the cloud or large data sets and you know this but I think this is something that still needs some research on how we make this operational and then there's linked data which has been around as a buzzword for some time and Tim Berners-Lee formulated these four principles of how heat things linked data should work that you use universal resource identifies to denote things you used HTTP URIs so these things can be referred to and looked up or as you call it dereferenced by people and by machines and then you provide useful information about things using standards like RDF resource description, framework or sparkle it's a query language and then you include these links into other related things when you publish data on the web so this is basically what I showed in this earlier the question with starting to look for papers and then going on to find data and other publications etc the question is how do DOIs fit into this picture of what the linked data community calls cool URIs DOIs being resolvable through HTTP services have a resemblance to that and could be used in the same way but I think we still need to do some thinking about how to bring these two worlds together so in summary persistent identifier now allows to publish site and identify data, specimens and software and as we see from the numbers data publication is now becoming more common the principles of data identification can also be used with other materials and with software and we encountered the same problems but certainly the future publication I think will consist of elements linked by identifiers and the paper will only be the interpretation but it will also provide access to the data to the materials that were used and to the software and workflows more and more of the repositories are now offering application programming interfaces based on linked data principles not all of them yet do but I think that's the way they have to go because that will make them more useful and it also fits with this idea of pushing processing into the cloud rather than downloading and processing on your desktop PC and there's also future data publication whatever publication in my mean will cater for both people as consumers of that publication as well as user agents, machines making use of these publications