 Welcome everyone to this webinar toward a common approach to data versioning. It's really an information session about the progress of the RDA Data Versioning Working Group and getting your feedback about the initial draft. So that's how we structure the webinar today. Lastly, we'll give an introduction of the RDA Data Versioning Working Group. It's a timeline on why we started this working group. Then we are talking about the work coming out of this working group that includes versioning practice and the patterns and the recommendations. Then, lastly, we'll present the report draft. That's contributing to other working groups such as the W3C and other RDA groups. They find us with a plan for the next six months. So now over to Lesley talking about the RDA Data Versioning Group. Okay, so for all of you, there's a few of you who probably don't even know what the healthy research data alliance is. It's an international group and I have the link there. It provides a neutral space where members can come together and develop and adopt infrastructure that promotes data sharing, data reuse and data driven research. It currently has just over 8,000 members from 137 countries. So it is quite an interest group and hence when you write something, it is going to reach out to quite a few people. I currently have 101 working groups and interest groups and the difference between them is that a working group has targeted outputs in 18 months. From where to go, you have 18 months to reduce something. The interest groups just keep going on and on as long as there are people interested in what that topic is. So again, you can see the diversity of groups if you go to the home page. The particular one I work with is the Data Versioning Working Group and one of the things you can do is click on that link and you can all join the page and keep getting updates to what's going on. It's led by Jens, myself, Ariasmi, who is a big leader in that Envery Fair project in Finland and Bob Downs of Columbia Uni who has a lot to do with the NASA DAX. So the other thing that the RDA does is have plenaries twice a year and we in Australia were starting to have problems on data versioning and so Jens and I ran one in September, what we call a BOF and we got enough interest that at the next one we said, well okay, can we form an interest group? And we carried that through to the next one in 2017 in Montreal where it was decided, you know something, we actually can create a white paper on the story of data versioning and so we formulated and started that group at the Botswana plenary and we have literally until Helsinki in September to get this final report and recommendations up and if the technical advisory board adopts it, it becomes an output and a reference to data versioning to all those members I showed earlier in my first slide. So why are we doing this? There has been attempts to do versioning in the past and there was another group in RDA who did it but that was tended to be more about I've got a database and it's so small I can actually snapshot it and create a version. But within Australia and again with ANZ as they then were now the Australian Research Data Commons we were getting much more complex patterns than that particularly from the big data people in, say Geoscience Australia with Earth Observation and the climate community. And so we're now starting to see on the scene people producing these data products that make it easier to access certainly in my world in NCI, we were moving data to the compute and we were getting these massive data sets, parallelising them. We were creating them by creating data from multiple sources and people were adding to them all the time and then because it was so easy to create a new version of them they were creating versions of them. When we did that a lot of controversies started to come out well who contributed what to what, you know, who acquired it, who's the owner, who's the custodian and above all another reason too was provenance. So there's a lot of work going on at provenance at the time and technically people worked out how to link data but nobody knew how to version it and define exactly what is a version. And so really it's more complicated and I've borrowed this and take you to the GA Earth Observation people who pointed this out to us that in actual fact we need to start talking about what I call a data pathway as well as data life cycles and so we get the raw data then we calibrate it and then we go into these data products and analysis ready and the questions we're starting to really come at us is to well hang on what is a new version, when is it a new product and more importantly with the world of identifiers expanding when do we put the data identifier on it. And so this led to this body of work and Anne's at the time said yeah look let's see what we can do and so all you wonderful people in Australia you're probably more active contributed a lot of use cases and gave your time to Jerry and so that's kind of where we want to be here is firstly acknowledge what you did then take you through what we've done with your contributions and how we're starting to work them into this white paper and welcoming you one and all to see if you agree with what we've done with your use cases and then more importantly if you want to join through that web page I gave you on the worker group and help us contribute that white paper and make sure your use cases are represented so now I'll hand over to Jerry I think is next who is going to talk about how she took all our use cases and what she did with them. Yes thanks Leslie. So as Leslie mentioned a key driver to take a closer look at data version in came from the work of another RDA working group on data citation which as Leslie mentioned recognised the need for systematic data versioning practices but the scope of their recommendations was limited to databases where versioning and time stamping of queries would allow for the retrieval of specific subsets and what we knew was that there were many use cases that would fall outside of those sorts of parameters so we set about the task of compiling use cases that really didn't fit with that model and we ended up collecting 16 use cases from 14 different organisations the use cases came from across astronomy the geosciences, biosciences, social sciences oceanography, climate science we tried to be as inclusive of domains as we could be and represent use cases broadly we also had cases we also sought to include different types of data types so we have use cases that describe remote sensing data, satellite data sensor data, imagery data, sequence data and software again trying to be as representative as possible across domains so I won't read out all of these on the screen but wanted to I guess indicate the breadth of the use cases that we collected in some cases the process started by being able to source documentation that was publicly available or made available to us where some organisations do have procedures around versioning of their data in other cases the use cases were collected via a discussion and we then went through the process of writing up the use case and then asking people to review that and ensure they were accurate you can see that we have a really strong representation from Australia it's obviously because we know the community in Australia and we were really delighted at the willingness of these organisations to contribute their use cases to us and on the previous screen you will have seen the link to the use case document that we would be delighted if you would go in have a look at the use cases make sure that you're still happy with those and if you have others that you wish to contribute please let us know in particular we'd like to thank the people listed on the screen now who kindly did provide their time to provide use cases to us. What we did then was synthesise all of those into a single document that then allowed us to work through those to identify patterns and differences between those use cases and then start to formulate a bit of a picture of current approaches to versioning and where there were similarities and differences between organisations and data types and Ming will talk us through some of those now so we analysed the 16 use cases we collected or contributed and found in consistences across these use cases. For example in consistence in what are the magnitude and significance of the change and in consistence in naming a version that cross the use cases and even from similar organisations there are different ways of naming it and in consistence in documenting the changes where just put a new version there without any documentation and also in consistence is in link between versions and the link version to same data products and also in consistence is in data citation so that's kind of the things the working group is pecking on and working on that and make recommendations so the recommendation actually build on the analysis of use cases and the two solids existing work from software versioning and another is from library people of record management so for the software versioning practice the hand is established software versioning practice where in software development apart from the platform and the versioning control system and platform like etc. most relevant to us here is the semantic versioning so the software development that area use version number like a major, minor and patch but the issue of this has its own meaning for example patch is the backfix usually for developers to track their changes the minor version would indicate as a functional change is this backward compatible and the major version was not incompatible expression of these three fields in numbers actually have the semantic meanings behind that another work we build on is functional requirement for data model developed by developing library sign there are four entities related to each other how that relates to they have versioning here so a work is a distinct intellectual or artistic creation so I would think that for related to data or research project that prompts the collection of data expression is a specific intellectual or artistic form of a work so in our area actually is actual data collection and all its versions can all come expression of data or research project and manipulation is a physical embodiment of expression so for each data expression or each version of data it may exist in different forms it may be in CSF in figures, papers, relational database each of them is a main manifestation of data and the item at the bottom level item is a single example of many so for us each copy of data set will be accounted as an item so apart from entities this model also give relationship between entities there are three classes of relationship here one is derivative relationship that exists between the work and modification by some network in data world will say a version of data set and one version of data set is derived from its critical source the next group of relationship descriptive relationship that exists between the entity and its description, criticism, evaluation etc so one example is all version of version of data set results from this data or research project and also a publication is supported by this data set for example and the third one is equivalence relationships that exist between exact copies of same form of work so I can say we feed data set to different repositories so we are able to say a data set from repository A is equivalence to the one from the repository for example so we are able to arm these two model or software version of practice so that's the working group make recommendations on these patterns the first three patterns is about identification is identification of each change identifying form of data and identification of single objects and collection and the fourth pattern is about the equivalence how to keep relationship between each manifestation to expression between versions of data products and the last one communicates the significance of the change I think that's probably also part of the equivalence so equivalence is not only about relationship and also about extra changes so that last slide that Min gave you is kind of at the broad level what we felt the work of the use cases you gave us could fit in too so the next stage is that we need to get it into this group report so I guess that's what we're again re-emphasizing make sure you feel that the work you gave is reflected in those use cases and you're happy with where you've ended up being slotted into so we've got the draft report and I can tell you it's a draft report it's very drafty you'll get double pneumonia if you stand near it which is mainly that in the hope that at Philadelphia in a couple of weeks time Yanzari Bob and I can walk people, walk people through it and the key things in that report are what changes constitute a new data version how are data versions identified, how to plan for new data versions and releases, documenting changes between data versions and what the challenges are for data versioning and we could write war and peace on that and in the next slide you can see that the two key concepts are we care about which data set is to be identified as a version and communicate the significance of this change to the designated user community of this data set and if you go to the next slide I just thought I'd show you this figure which is another one I picked up on Furba but you can see how it goes through that equivalent derivative descriptive and then down the bottom those triangles where you're trying to work out whether it's original it's an equivalent it's the same work for the derivative work moving over to the phase where at some stage you will say this is a new work it's not that compatible it's something I have to identify as something new so again you can sort of see and this also is some feedback that some cocks gave me about the W3C partly why they're not going ahead with that in the current revision they're doing is because they got so many versions so many facets of versioning that more work needed to be done and so it comes back to your organisation or wherever you're working for having clear policies on how you are going to version your work and be able to communicate what the differences are between your versions so now I'll hand back to Min who's going to talk about the concepts that I just elaborated on and what topics we want to cover and again I'll go back to you all to say please help contribute if you want something more write to us or go and type it into the report directly the link's been handed out thank you Min okay I will present six recommendations in just to get your feedback the first one is version control and revisions so different from its cursors is revision in any case each revision should be version controlled should be recorded but it should each revision have a persistent identity so recommendation here each change, each revision should be identified recommendation from data citation working group because they deal with dynamic data sets they are recommended to time step each change so yeah but how that's required to each data repository just will depend on the policy of yours next so the second recommendation is identifiers or data set revisions so if a revised data set produce a new entity what we mean here if we think about that F R B M model so when new entity is becoming new expression of work that means there are enough intellectual work put into that change and also have significantly influence to the research that underpinned it should be a new identifier and again is this new identifier local to your repository or persistent identifier or global persistent identifier will depend on the policy of data repositories but in any case the revision, the change made to data set should be communicated to user so the development catalog and the third recommendation is identified release of data set so you have revision of data set but you may not release all revisions but when you reach to the point with one particular version we need to formalize the words whether but release here I would think you publish and make version available to public then this release should be accompanied by revision of what the changes made and become identified by itself and this is the fourth recommendation this is how to deal with data collections and the data object within collection so the argument here is the collection itself should be versioned and has identifier and each item or each data set within collection is self a work, if it's a self a work then it should be also identified and have a revision history so both at the item level collection level both should be revision should have revision history and have identifier upon themselves and this is the fifth one but also about identification or identifier so this is identifying manifestation of data set in different form if you compare the base may be different but the content are the same but in different form so whether or not to give PID identifier for each form or manifestation that would depend on if that use of each form the last is the problems of data set as I said in the pattern before so we should record the relationship between each version and its work the data project or data product however the linkage is one kind of problems and we should also contain the information on the problems of data set so PID is changing any change and should make that include in data record as well the last recommendation is for data citation data citation has optional elements for in citation the reason is the optional because it's not all resource has a version but if the data set you released have version you should include that version include that version in data citation data sites recommend to use the semantics versioning as I talked before as a new identifier for major release that's the recommendation is for data providers to use that semantics versioning but DOI itself may not reflect that semantics bearing under the citation itself may not show the relationship between the versioning but for data records the summits to data site you probably should have that relationship described it may not in citation but when people follow DOI getting the landing page they may able to see the relationship to other other versions so that was the six recommendations we come up so far four of them is about identification one is about providence, one is about citation so we would get your feedback is this cover your user cases or you think other recommendations we should provide and apart from having these recommendations the working group also the data versioning working group and also working with other working group example here is W3C data set exchange working group this working group is working on the Dcat data catalog for category for that purpose for the purpose of developing Dcat they also have elements of data versioning here is five user cases they have one is clear about what subject to versioning and what provides definition of versioning and how to what identify should give to each version and in code release date and version data that's providence information so what we did is we look at the user cases 16 user cases identify six of them here and they are in four categories so these user cases either provide concrete example or requirements for the existing user cases they already have and also have new requirements for example should we have a vocabulary for changing types we see patterns the changes they have they add a delay to modify different categories should we have a vocabulary so that can communicate better to user what changes has made and how we formalize the lemming and lemming the version so that's a contribution to the W3C working group and we discussed with them the user cases we'll take from here okay so I'll go back again some of you may not be all that familiar with the research data alliance but this is a list of the interest groups or working groups that have approached us because our work is relevant to them as well so as you can see it hits on providence patterns software source code and the other thing too is that data foundations and terminology and when you're trying to work with an international group as the previous slide said we're going to have to start to get clear on the vocabulary so we're still a lot ahead of us now but as I said earlier if you are a working group then I give you 18 months in research data alliance so we're heading up to our 12 months time frame in Philly in a couple of weeks time and this is this draft report that we've talked about that we'll be doing a lot of midnight oil to finish we'll also be having a video I want to listen in an international meeting with the team at 10 o'clock what are we on daylight time this Monday and then before Helsinki which is P14 in September, plenary 14 we will have to have the final report and recommendations submitted so that at Helsinki we can actually do a proper handover so on the next slide this is a slide from the chairs of the working group that's myself, Ari, Jens and Bob we'd really like to thank you because the Australian news cases were outstanding they were, I'd say at least 50% of them came from Australia and also to those who joined in discussions at the plenary sessions and along the way we are very very grateful to Ann's as it then was and now ARDC to mean Jerry and Julia because we would not be in the state we are use cases came in they were a wee bit overwhelming and we thought however we've got this out but those three were just wonderful and then we've also had a fair bit of support from the RDA secretary particularly Stephanie and Tobias for us to keep going and reduce something out of this work so thanks again to everyone and I now hand over to Jerry because she will do the question, organise the questions so that's the end of our formal presentation if you have questions or comments please put them in the question pod or at least just flag that you have a question and then we can open up your microphone and you can speak to the group to kick us off Julia has indicated that she likes the use of FRBR a good choice for humanities data where FRBR is already familiar would be good to see a worked example of how this would work in practice for a live data set is the group thinking about putting together a few real world examples mocking up a versioning example of one or two of the use cases probably a question for you Leslie? Is that Julia from the National Library who has contributed that? Oh thank goodness for that we've actually spoken to a couple of librarians that were using further and they look at us and say oh that was last century what are you using it for but yet in our research it was the only thing that had a conceptual model that expressed what we're doing and I was hoping you would come on this call for this very reason and I started to get a revival in the libraries and we'd love to work with you if you want to join us or can put us onto someone who can work it through with us because it was actually having those different concepts of manifestation of works and that that started to get it through the complexity of what's happening with the digital data sets and I don't know whether Simon Oliver is on the line but he was a pain in the neck with the observation data because we just looked at it and realised how complex it was but realised we had to do something because snapshotting was not possible with his work so if we can get in contact afterwards and follow through with you we just love it because at the moment we don't know anyone but I'm using Berber and as I'm a geologist I feel like I'm using something from the fossil age Thanks Tessie Sorry go ahead Yeah Scrum that's a co-chair of this working group but he is the fan of the model and he's trying to use this model for many other things Okay thank you So Nicholas he has asked whether we considered the Portland Common Data Model I am a geochemist I have not heard of it So can you please send us details So I think Nicholas you have come across something that the group hasn't heard of before and of course we're open to any input so if you could pass that on Ming's email address my email address we'd be very happy to receive some further information about that So any last words from Leslie? Yeah I'd just like to say again we're wanting you to contribute with the Australian news cases with the main input to this So thank you again Okay well if there's no further questions we'll close off this webinar has been recorded and the recording will be made available in the next few days and anyone who's registered will receive that recording please feel free to share that with your colleagues and if you want to refer back to of course it's a resource there for you as well as for us. So thank you all for your contribution to both the use cases and for your participation today. We thank you and wish you a good afternoon Bye for now