 Right to go. Okay, thank you very much for the introduction. Welcome everybody and thanks for joining. What I'll try to do in the next half an hour is to go quickly through the things that we've been doing in the data citation working group of the RDA. So what I'm planning to do is initially very briefly introduce the challenges that we're trying to meet within this working group and the solution that we have been elaborating until pretty much the last plenary in Dublin. So this is basically for those who haven't been following the activities of our working group are. And the second half will be devoted to the things we discussed last week in Amsterdam, basically reporting back on a number of pilots that people have tried to implement, where people try to implement the solutions that we've been working on and then take a look at the next steps. Well, the the start of things are pretty trivial. So I can probably go through them pretty quickly. Siting data should basically be easy from, well, the old-fashioned way of simply providing a URL to a data set on the field, by providing a reference in the bibliography section to, you know, professionally planning a system identifier and, you know, depositing the data in the repository. And seeing those kind of citations happening all over the place. Things that we want to address in the working group is two issues here. One is that currently the data sets that we can cite basically have to be static ones. So basically a data set is generated, is uploaded into the repository, and then you can cite that data set as it is. As soon as you want to correct errors or as soon as new data is being added, as soon as you've got dynamic data where data comes in, whereas these versions of these different states of data cannot be easily identified inside it. Ways around that that we've seen in in different use cases is that people either basically cite the entire screen in terms of to provide an access at date saying, you know, that that's the status of the data as we saw it. And that's the one that we used in the study. Or another approach that we see frequently is kind of an artificial version where data is collected continuously, but then released in batches on a yearly basis, basically delaying the time until it was being made available for back to a conventional versioning schema. What we would like you to report with the work in this working group is to allow data to be used at the time as it becomes available, either when new data is being uploaded into a database, or when existing data is being corrected, correcting errors or recalibrating values in a database, without delaying the release of that data, but still allowing researchers to cite precisely the data that they have been using within in a study at a specific point in time. So it's this, this is the core focus on the on, you know, being able to identify dynamic dynamic data. The second aspect that we're taking a closer look at is the granularity of data citation. Beyond the dynamics that we want to address, the second topic is the granularity of citations. When we have data collections of you, and almost as where sensor data is being collected continuously, it's very rarely that researchers use the entire data that's represented in the database for a specific study. Rather, you researchers usually use a subset of the data, be it either a certain time span of data, or a specific set of measurement attributes that is being fed into a study. And what we want to support is a way of citing that specific subset of data as it was used in the study. Again, we started off by taking a look at current approaches of how this is being done, and the ones that we come across simply don't scale up. So one thing that you find quite frequently is that people deposit a copy of the very specific subset of data that they were using in the study together with the article. That works fine for small amounts of data, but if you always have to deposit whatever terabyte of data with each paper that you're submitting, that's not going to scale. Another approach that is quite frequently done is that you're citing the entire database and then provide a textual description of the subset that you're using, saying I'm using whatever measurement values 5, 7 and 10, and the time span of 1st of January until 27th of June. But that's usually quite ambiguous because it's hardly ever precisely enough specified whether the from and to data, both included, excluded, is an open or closed interval, which outliers were removed and things like that. So what we want to have is a way to precisely identify even the very specific subset of data, so an arbitrary combination of rows and columns if we talk SQL databases that is being used in the study. So it's those two questions that we really want to address within the work in our working group. And what we basically started off from is that the existing approaches of assigning persistent identifier either to entire datasets or to whatever release static datasets in fixed time intervals don't really work well with the continuous data streams that we see in many eScience domains. So what we want to have is a precise way of assigning that and in such a way that it's actually machine processable, so not having verbatim textual descriptions meant for human readers, but to have precise definitions of the citations that allow machines to go back to the very, very data being used. So the goals that we set for this working group was to come up with a solution that allows you to cite data in situations where the data is dynamic, so where data is being corrected for errors, recalibrated quality assurance measures being applied, or new data being added to a database, one that allows researchers to cite arbitrary subsets of data, any combinations of rows and columns or time spans in data, basically from a single number to the entire dataset. And the approach should be stable across technology changes, so replacing a database, migrating the data to a new database system or schema should still allow you to go back to the very specific data citation, and in a way that is machine actionable, so not just machine readable, because almost everything is machine readable, and definitely not just human readable, so we want to have a precise definition that a system can parse and access the data. And this kind of frame setting, it should be scalable to very large and potentially highly dynamic datasets with very frequent updates. So that's kind of the setting that we started off with this working group as part of the research data alliance. The working group was officially endorsed just before the previous plenary in Dublin in March this year, and so just to give some, so to recap some of the discussions that we had prior to launching the working group, we decided to really keep it very, very focused on the technical aspects of identifying those subsets. We do not discuss really what kind of metadata to assign to such a citation. We do not really discuss bibliometrics and how to give credit and how to pass on that credit, but basically trying to find a solution that allows us to identify that data set and then pass that solution on to the other working groups that deal with metadata assigned to citations and credit giving and accreditation of data. So that's basically the starting point for this working group. The solution that we came up with when basically up until the last plenary in Dublin was actually a very simple one. So as a starting point, we basically have data represented in some form of data repository, be it a database, a SQL database, be it commerce of value files, be it an XML database, be it a triple store, be it a NoSQL database. So any kind of data representation and we do have some means of accessing that database. So some form of doing a kind of select statement on that data. In order now to enable the kind of citations that we want to support, we need to ensure two things. The first thing is that the database itself is timestamped and versioned, meaning basically that we keep a history of all the changes that happen to the database. So whenever a new value is added into the database, it's added with a timestamp of when it was added. And the value that is deleted or changed is not overwritten, but is basically marked as deleted and the new value is being added. Again, with a timestamp of when this happened. And we found actually that in many research databases, this is already the case. So many of the research databases are already timestamped and versioned. What we then do is when a researcher creates a working set or select the subset of the data via some kind of workbench interface, what we do is we assign a persistent identifier, not to the resulting data, but we assign the persistent identifier to the query that was used to select the data. And that query again is timestamped. So what we can then do later on is to re-execute the timestamped query against the timestamp database. And that allows us to retrieve exactly the same subset of data as it existed at that point in time. And I'll go into a little bit more detail about that in a minute on what we will be getting back as an option for that. But basically what we need to have is a timestamp for the query and I'll now go into a little bit more detail on what we do in terms of query rewriting for normalization and unique sort and hashing. But that's basically the core principle, having timestamped and versioned data and having a persistent identifier assigned to a timestamped query. So the first thing that we were discussing is when do we assign a new or an existing persistent identifier to a query. So basically what happens when a researcher puts together a query via the workbench, the system in the background analyzes whether an identical query has already been issued before. So whether the semantics of the query is identical. So basically the system processes the query, takes a look at the query string and the subset or the result set that is produced. And if both are identical then we would assign an existing persistent identifier if the query has a new semantics or if it is an existing query but the data has already changed in the database, the background, then a new persistent identifier is being generated. And then on the system side that persistent identifier in the query is being stored. It's important to notice that only checking the result set for identity is not sufficient to determine whether a query is identical. The fact that two different queries, so basically it can happen that two different queries can return the same result set. In this case we still want to capture them as being different queries because what we want to identify is the query semantics. Because we want to later on support the query being executed against the current version of the database as well and I'll get to that a little bit later on. But basically what we want to make sure is that an existing persistent identifier only assigned to query if the semantics is identical. So if it's basically the same query. Now I'll show you an example of that later on when we come to the pilots. In order to determine this whether two queries are identical we need to basically normalize the query the way it's being written. So what we do in the first stage is whenever such a query comes in via the workbench that the researcher is using to select the subset of data depending on which workbench are being used. We might need to do some standardization and normalization of the way the query is written. We then need to adapt the query to the versioning approach that is chosen in the database system, whether a history table is being used or whether the versioning is happening in the operational database. We then add a time stamp to the and I'm now using SQL terminology basically to the select statements. So basically to the query processing statements and then we need to identify potentially the last change to the result set touched upon to find out whether it's basically constitutes a new version of the data or not. Something that I'll also go into a little bit detail on one of the following slides is the unique sort to identify whether we get the same result set when we come to hashing. One but next slide. So what we do in the normalization is usually normalizing the spelling of the query. We rearrange the sorting of the filtering criteria and we compute a hash key over the query string to identify whether that query has been issued before. If an identity query has been found we rerun it and check it whether it still returns the same result set and if that is the case then we assign an existing persistent identifier otherwise we mint a new one. There is two things that we need to consider when we want to identify the identity of the result set. So to be on the safe side when we want to re-execute a query in the future we want to be able to compare the result set whether it's really identical to the one that was retrieved on the original query. So we compute a hash key over the result set and in order to do that there is different ways how we can do the hashing. We can compute the hash key over the entire result set which is comprehensive. It considers all the rows and columns in the data basically as it is being returned but it's potentially quite expensive if you want to compute the hash key over whatever the very very large data set being returned. Another approach that we are currently discussing is to compute the hash key only over I'm talking again SQL style data the rows and column headers so basically the identifiers of the of the data being returned which should be sufficiently which should be sufficient to determine whether the same data is being returned. It doesn't of course say it's got against unmonitored changes to the attribute values but that's something that actually should be captured on the database side already. Another thing that I've got on the top of this slide that we came across when we were working through a few use case scenarios was unique the unique sorting. Something we found out is that well something that's obvious is that databases usually are set based so basically you issue a query you get the same set of records back as a response to the query but not necessarily in the same order. If the data is being migrated to a different database or if databases are highly distributed you might get the data in a different sorting order unless the user has defined a unique sort in the query. Now this is not a problem from a database perspective but in settings where we want to reuse the data in order to verify an analysis process we need to consider the fact that most analysis processes are sequence based so what we want to ensure is that the data is not only the same set of data being returned from a database but that the tuples also come in the same order and we do this by applying a unique sort on on any table touched upon prior to applying any user defined sort so by doing that we can basically guarantee that when any joins are being executed at a later point in the query processing that the tuples are returned in the same order and does any processing that happens afterwards that sequence dependent should be able to provide the same you know basically get the data in the same order and compute the same results. Another thing that that we started discussing just before this summer was which timestamp to assign to a new query remember we said that the query must be timestamped to be executed against the timestamp database and the initial choice was to use the timestamp of the query processing so if I pose a query now the query gets that timestamp and it gets re-executed with the timestamp. Now we then later started discussing some other timestamps that could be used and we haven't come to a clear conclusion yet of which one is the recommended all three options will lead to the same result but what something we could also do is we could apply the timestamp of the last change to the database whenever the last update happened to the database that would be the inverted commas the version of the database as it was being used for this query and the third option that we started discussing is a bit more complex it basically would assign the timestamp of the last change to any tuple that appears in the result set including markings as deleted of course the last version is the most complex in terms of the query rewriting but it would most closely resemble what people would consider a version of a data set in the traditional sense so basically that's the the status of the the timestamp of the status of the change that we would find in the tuples in in our result set again we haven't come to any final conclusions of which one which of the three options we would like to recommend in terms of fulfilling the requirements all three are identical all three would reduce results produce the same results set upon re-execution but there is differences in the beauty of the solution or you know the conceptual beauty of the the three different timestamps that we could assign so that's still an open issue and i'd love to hear any thoughts and comments on that but just to wrap up the the solution that we're applying so basically the building blocks are we need to have uniquely identifiable data records in order to apply unique sort we need to have version data where changes are marked as deletion and insertion rather than overwriting or updating a value and we need to have timestamps of when those insertions and deletions happened and there must be some form of query language that allows us to construct subsets if you want to support citing them the modules that need to be added to any such solution is then a persistent query store where we store the queries the timestamp the hash key and any metadata that describes the query we need to add a query rewriting module if we want to identify identical queries and adapt it to the versioning approach in the database and we need to have a module that assigns then the PID to the query and stores it in the query store and of course um yeah some module that adapts a land that creates a landing page for somebody who then re-executes who wants to re-execute that query so that's the building blocks and now if you want to deploy the whole thing um from the researcher's perspective it looks it it's basically completely transparent so a researcher can use the standard workbench to identify the subset of data that they want to use for a study whenever that selection has happened and the researcher presses whatever the download button or basically any button that feeds the data then into their subsequent analysis process that data is being made available for whatever api persistent identifier is created and the query is timestamped transformed and written into the query store a hash value is computed over the data set and stored as well in the query store and usually a kind of recommended citation text whatever bit tag is generated for the user to encourage them to actually put that citation directly then with the study when somebody then resolves that persistent identifier so executes the citation it would take them to a landing page where more detailed metadata on the on the data subset is provided so a description of the subset a link to the parent set so the database where the data came from and obviously a download link to the data subset and this is actually one very neat characteristic of that solution is that from the landing page we can re-execute the query with the original timestamp against the timestamp database and get the original data or we can re-execute the query against the current version of the database so getting basically the same query semantics but with all correct corrections to the data that have happened since and of course giving those two things we can also get the changes that have happened by basically providing the diff between the current version and the original version and that's actually a neat difference compared to another approach of for example storing just the identifiers of the result set or keeping a dump of the original data because that would make it much harder to trace any changes updates and corrections that have happened to the database and we actually think that this is a pretty neat side effect of applying the dynamic basically query based version of identifying a subset and again once that PID is executed it's executing against the timestamp database and the results can be returned and fed into the analysis process again so far for the solution now this was basically pretty much the status that we had at the Dublin meeting save aside for the discussions about the timestamps that only emerged during a summer now what we did at this meeting is basically take a look at some of the pilots that have implemented have been implemented since then either conceptually or some of the pilots have also moved already into a technical implementation stage and see what are the the principles work so we have a few examples now on SQL based data because that was the easiest one that we started working on and I'll show you briefly the SQL based solution and then some of the examples on comma separate value files xml data and I'll report briefly on two workshops that have happened in respect of this data citation work so for the SQL database there were two scenarios that we started working on one is Elnick a laboratory of civil engineering Portugal that we are collaborating with in one of the European Union research projects they have a huge database of sensor data monitoring dams and bridges for stability of those constructions basically and they have periodic reports being generated over that data that goes into then you know basically calculating a few models and provide it preparing a report automatically out of them the other SQL prototype is from a different domain from information retrieval the million song data set which happens to be the largest benchmark collection in music retrieval settings where economists some time ago provided a set of features not the audio files because in music are people are not allowed to share the actual music due to copyright reasons so it's only the feature sets that are being shared and what happened is that people downloaded the audio from public domain sources and computed additional features from it and there's lots and lots of corrections happening to that data either because higher quality downloads become available and people then recompute the feature extraction the higher quality recordings or new versions of feature extraction algorithm are being released correcting for errors so that's kind of the dynamics that happens there and the subset selection works in that way that researchers don't usually perform the analysis on the entire million song database but either focus on a subset of genres or they only select songs that have certain characteristics such as a minimal length of six seconds or certain encoding schemas and frequency resolution and what we started working on is different ways of doing the time stamping and diversioning I mean none of this is really new I mean time stamp and versions sequel databases have been around for quite a long time and the solutions are pretty well known either you do an integrated solution where the time stamps and the versioning are basically added to the schema has the advantage that this is the most storage efficient way of doing it the downside is the apis of existing tools that access the data need to be changed so this was possible to do for the million song database we could not do that in the elnic case because they had a lot of external programs that were accessing the database and changing the database schema which have led to a lot of changes in a lot of tools and apis that that access that database so for the msd we went for the integrated solution for the elnic case we went for a separated solution where basically the operational data is being kept in the operational database as it is and all the history values were written into a separate history table together with the with the inserts which basically is is a trade-off on paying the the the simplicity of not having to change the apis with a higher storage demand for the data then we added a query store that contains the pd of the query the original query string the rewritten query and the query string hash the timestamp hash of the result set and some metadata basically the creator so who created that subset possibly a description of what the subset was created for and the persistent identifier of the parent data set so in that case of the million song database as a whole one neat thing we learned actually when we took a look at some of the queries that were generated then during a few test runs internally was that storing the query gave us also a lot of provenance information about the the data set so one example we came across was a data set where researchers selected songs of a certain genres between six second or minimum 30 second lengths up to a minute length so removing all the short ones but then also manually removing a few song IDs so basically didn't basically having a list of whatever seven to ten songs that were manually removed from the result set and that actually it doesn't explain why those songs were manually removed but it at least documents that a few songs were you know manually deselected from the result set which then later on allows us to trace you know basically double question why this manual changes were being applied to the data set rather than just providing a listing we selected a subset of songs from the database just for those interested some examples of the query rewritings that we did so sorting the where clause so whether the loudness comes first or the duration comes first doesn't make a difference so we sort them alphabetically in terms of the filter criteria the mind that if the select statement has a different order of the race of the of the columns so track ID artist and release versus artist release and track ID then in that case it's not the same query because the result set comes in a different order so basically the columns come in a different order so we need to treat them as non-identical queries even if they carry the same semantics and then there was a little bit of rewriting in order to accommodate the history table and the timestamp none of that really terribly exciting but standard standard technology just to show that we we did it basically then second pilot on comma separate value data this one was actually very surprising because we hadn't envisaged on dealing with comma separate value data at all when we started in this working group focusing on dynamic data large scale data and so on but during the preparatory meetings for this working group it turned out that comma separate value data seems to be so widespread and people were so interested that we simply had to to deal with it so we came up with a reference implementation for this comma separate value data there were basically two potential solutions that we discussed the simplest one was basically leaving the data as comma separate value files and applying a versioning system in the background so something like subversion or git and basically just using a subversion system or a git system for the storage which then does you know the diffs and storing them with a timestamp and allowing people to download the file at specific versions so that's one option that i'd still be interested in seeing implemented because it's actually very trivial to do the approach that we chose for the reference implementation is a bit more complex because we also wanted to support subset generation and what we're doing in the background is basically allowing people to upload comma separate value files converting them in the background into a relation database into postgres database um transparently to the to the user so the user doesn't you know take notice of that migration and providing an access interface that allows subset creation just to support you know a more detailed interaction and more yeah a few more features of interaction with comma separate value files and what we've got within this prototype solution is people can upload data can update existing data if data is being corrected we had that use case with one statistics department where they basically get in spite of the statistics being computed very thoroughly they get updates to the data at later points in time with errors being corrected so they have to recompute them so they do have versions in that data as well so you can basically upload a new CSV file it then generates a schema in the background you identify one column that is the primary key and then it migrates that into the database and then on the other side you can once you've got that data in the background you can then create a subset by filtering um according to some of the attributes and identify a working set for satisfying those select criteria basically that you can then assign a persistent identifier to um we've got two case studies that work with XML data one that has just started very recently and that's not completely implemented yet but basically working on XML trees that are used in different different settings and what we want to do is what we are currently taking a look at is two different approaches to do the versioning so either by whenever a value changes to copy the entire branch which is the simplest solution but you know for very complex um XML files with a lot of storage space or a more complex solution is to introduce explicit parent-child relationships and basically only copy those that change which is very very similar in principle to the solution that we have with SQL databases and then for the citation framework we would add again the marking those insert updates and deletes with timestamps basically apply the same principles for the query processing um we're using a query engine and rewriting that one so the database that we started working with is space x and we adapt the query parser to replace the insert still or we plan to it's not finished yet plan to update the query parser to rewrite the inserting and deleting to replacing and renaming statements and we then hope to be able to use that query parser as a front end to basically any other kind of XML based database such as xsdb so that's still work in progress so I can't say much more than what is on those slides so far but it seems to be so at least conceptually it seems to be working we have another example from the field of linguistics that Dieter van Itwenk presented during the breakout sessions from the field from field linguistics so what they have is an archive of transcripts of recordings in xml files with lots and lots of annotations updates are very rare but it happens every now and then that researchers correct the transcripts when they encounter an error so they usually have you know sometimes only single versions of a transcript every now and then a few corrections happening to that but what happens rather frequently is that people want to cite fragments of that transcript so instead of the entire xml file so it's not a sanitation of a list of files but really of the various branches in the xml tree and so what they implemented so far is they added timestamps whenever a new version is created the current system that is online if you if you apply that link only provides a timestamp of the last version that is accessible the previous versions assuming that their corrections by default are hidden to external users and they're only visible to the original author that's the current the way the current permissions system is set up but that's basically a technical limitation that they need to discuss in order to allow the previous non-corrected versions to be made available as well if they have been used in a study but basically the the data is there and can be can be provided and the handles together with the handle they basically now also store a checksum of the result set and the timestamp when this handle was being executed so basically the time that it was collected so that seems to be working and they have a better version that should be released this October basically I guess I have a screenshot here yeah so basically what people can do is they can basically create working sets of what is what they call virtual collections subset of the the entire collection that is used for a study by filtering for certain criteria and basically you define your query that way and say this is my working set that I want to do my study on and then this one is being stored and can be re-re-executed and recreated and again the nice thing here is it's that's not implemented in the current version of the prototype but basically once you have got this timestamp query you can go back to the original working set as it was generated but you can also get the working set with the updated data in the database if any new data has been added by executing it against the current version then one workshop that I learned about only a week before the the meeting in Amsterdam was done by VAMDC the virtual atomic and molecular data center and I only got basically a a re-put of the workshops so I can only provide information on the official level but basically what the data center is it's an infrastructure of 41 distributed databases they use the basic stored data of the same semantics but they use completely different database technology underneath the distributed setting um what they have is a standardized xml format for exchanging data and you can access the data either by accessing the various nodes independently or via central portal where users can create a query that query is then being forwarded to the various nodes these retrieve the data send it back to the central node would then compile them and returns them to the user and what they did is they ran a workshop in order to see in how far their system supports citation of data and they found out that basically all nodes independently could modify and add delete data and thus there was no no way any previous study could be reproduced or any way to identify the result of a query at an earlier point in time so what they did in the first stages they wanted to apply the solution that we have developed in the working group and here comes a very interesting challenge this is the first setting that we have where it's actually a distributed database that needs to be dealt with and they came up with questions of you know would they need to have a separate query store on each node uh was it sufficient to store the queries at the central store so what how do we apply this approach to a distributed setting and this is something that we we only had a chance to discuss it very briefly at the meeting in Amsterdam so I can only give a you know preliminary answer to that but the current feeling is that what we would recommend to do is to have each node process the query as it is currently completely independently having its own time stamping its own versioning system because it would be very hard to ensure synchronized clocks across such a heterogeneous architecture so each unit each node can have its independent time stamping its independent clock and its independent versioning system but then at the central node the result sets would be sent together and aggregated with central time stamp with the various time stamps of the units and would be collected there and the query would then be distributed again to the distributed nodes with the respective time stamps that the systems have so this is as far as we've got the solution so far what they did for now is they started implementing the versioning approach and the locally time stamping and they adapted the example or they will start to adapt the XML format for communicating between the various nodes to allow or to support that kind of time stamp and we need to discuss the best way to to deal with that in this federated architecture last but not least workshop that we ran with the NERC data centers in London based based on the disciplinary meeting we basically got together to identify and work on on four data sets I've got them on the next slide basically we try to figure out for the for four different NERC data sets the butterfly monitoring setting the ocean point network some sociological archive and the national hydrological archive how that kind of data citation that we were proposing in the in the working group could be applied there and we took a look at it from four different perspectives from the perspectives of the data user the data depositor the data center and data publishers journals basically who would then need to carry their citations we basically tried to identify the strengths and weaknesses of the approach finding out issues that you know PIDs were very welcome but some reports some of the repository owners they were worried of having to master too many different PIDs thousands of PIDs for each potential query we talked a little bit about of attribution how can post the author of a subset get attribution but also the data center that hosted the data and things like that so we found out that we needed for example two different PIDs one for the data set as a whole and more fine-grained PIDs for the actual subsets and we discussed how these could be connected and so overall we came up with a number of recommendations for the various pilots in NERC the Argon Boy network will now move forward to actually implement a prototype solution and other NERC centers will investigate in more detail for the effort required to deploy that and we'll have a similar workshop with ESIP in January in Washington because we found out that that way really is the best way of testing the approach so picking a few concrete data sets and working through the scenario to test the effort to see how we can can apply it so that's basically it we have some pilots in the making we have some new workshops that are planned what we've learned so far is that it's really very helpful to work with concrete data sets to run through it first conceptually to see how it can be done and how it meets requirements but then to also try to implement it at least in initial two examples to test the viability of the approaches and see how we can optimally implement them so that's the current stage here's a few links to well the website of the the working group the mailing list the web conferences that we have at irregular intervals and the list of the pilots that we have collected initially and that's basically it thank you