 Good afternoon, everyone. My name is Xiaoping Shen. I'm from Ants. Unfortunately, my colleague Natasha is not feeling well today, so I will be your host today. My colleague Susanna is also co-hosting today's webinar with me. This webinar is part of webinar series co-sponsored by Ants and the Council of Australia University Libraries on Seam of Research Data Information Integrations. Our previous through webinar has already covered the DMP tools, system managing ethics, and the data storage. The recording of these three webinars are available on Ants YouTube channel, and today we will talk about the data publishing. First of all, we would like to acknowledge our co-sponsor, the Council of Australia University Libraries. We would like to thank them for their support. Secondly, we would like to say acknowledge Commonwealth government for their support of Ants and increased programs. So, with that, let me introduce our first speaker, Dom Hogan. Dom is from Siro Research Data Service Support Team, Information Management and Technology site. Today, Dom will talk about the data publishing at Siro. Dom, over to you. Yeah, good day, everyone. Thanks for having me along. So, just to explain the broader context of data in Siro, I have this little slide here from what things looked like about 10 years ago, or maybe a little more than 10 years ago now, and we had about 20 or so. Actually, I've forgotten the count, but we had a number of different, what were called divisions at the time, and each of these pretty much ran their own show. They got their portion of Siro's funding, and they had their own departments, they had their own libraries, and there was collaboration between divisions, and there was collaboration between the libraries, there was a Siro library network, but all in all, there were varying standards of information management throughout the organization, just due to the nature that they were run separately. So, around that time, what happened is that there was a change to the formation of information management and technology, and so this was one service for all of the divisions in Siro, which at the time included IT libraries and records, and that allowed us to take a unified approach to say data storage, networking, computer infrastructure, and so two of the things that came out of this was the publications repository, so we were able to merge our legacy publication citations, and also have a unified approval system for new publications from 2009 on, and we also got working on data.siro.au, which is our data repository. Now, as I said before, with those many organizations that we had within CSIRO, some of them had very high data management standards, and then others had much lower standards, not due to the needs of the research, so bringing in this data repository. Actually, there would have been some organization, some part of PUTs of Siro, that actually felt that they had things pretty much under control, and they weren't really in need of a new repository, whereas other parts of the organization had been really crying out for this sort of thing. So the goals with this repository was to provide consistent access to data, and also version control, so this sort of enables the reproducibility of the scientific outputs, so that you can actually get down to individual versions of what you have. And the other goal there was self-service, so we wanted Siro researchers to be able to just log on and create their own data collections, write their own metadata. Now, that can sound a bit scary, as maybe researchers can be forced to write that metadata. So far, what we've found is that the people who are using it are the people who really want to use it, and sort of they want to get their data out there, and so they usually put the effort in to write fairly decent standard metadata. And there's also an approval system in there, so that the approvers can go through and suggest reviewers, and there's a little bit of peer review that can go on there. And another goal in here was scalable storage. So as we have many, many terabytes or petabytes of data coming in towards us in the future, the way to store it, so that maybe how things were done before is you would have, I'd say, a file server with as much disk space as you could put in there, and file sorted it however they happen to be. And there's an expectation that those files will be available to me right now. I'll be able to get them. But that's a very expensive way to host data, especially when you're getting into the petabyte scale. So what the DAP and what other parts of Sira storage going for is these different standards of storage categories, so that data that maybe need to be preserved, but doesn't necessarily need to be accessed instantly at any given time, can sit on tape, and then when it's requested, it's loaded from tape onto disk where it can then be accessed. And so I've taken this slide from in-corner and Renate Tihouse, who are working the storage area of CSIRO. And this is a bit of a model of their idea of a scientific workflow where in our cloud, what we have here are the various storage categories. And so the researchers don't need to think of the individual machines hosting any of this data. It's somewhat abstracted from them. All they have to do is know what the address is, and then the storage team actually takes care of the rest. So they don't need to migrate between servers when there's an upgrade to the hardware. They can just keep pointing to the same address. And then you have these different storage categories like input. You have the verified work and static. And so they are actually stored on different media that are optimized for different uses. So the static reference data may be sitting on tape and then reloaded later on when we need to reprocess it. And you can also see that the idea in there is that as a project goes on, it would be moving the data through various quality control processes and eventually to publication. And the publication part, I guess, is where the data access portal comes in. But not always. Sometimes it can be a domain repository. And then various parts of CSIRO have their own ways of managing data. But we're trying to move towards having one unified repository that at least catalogs everything in CSIRO. And we're making progress. So the data access portal itself. Now hopefully no one can read this because I realize there's a few errors in this diagram. But what I'm trying to represent here is that the data access portal is not really just one system. It's actually a few systems that play together nicely. So what we might have, I don't know if anyone can see my mouse cursor here. But so without CSIRO researcher, they're entering metadata into the user interface and uploading, at the moment, uploading their data to an SFTV staging server, which is all fairly straightforward. And most researchers get that done without ever asking for any help. And then, of course, you've got a database that stores this data or the metadata and also assesses the data files that come in and records some metadata about the files as well. And then that all gets sent off to this thing called the logical collection manager. And what's happening there is this is the thing that takes in the requests for data or takes in the submissions for data and then decides what to do with it. So if the data happens to be someone's requesting data from TAPE, so we have our research community and they're saying either asking the user interface or the web services interface for some data, then the logical collection manager is going, OK, well, that's sitting on TAPE. I'm going to need that to load that off onto disk here so that then somebody can then download it. And the thing about this TAPE is that the main delay that happens here is that the tapes themselves move data very fast once they're loaded. It's just that waiting for the robot to actually load the tape. So even for, say, a collection that is maybe hundreds of gigabytes, typically this will only take about, you know, 15, 20 minutes before somebody gets the notification saying, hey, here are the files, we can start downloading. And then I also wanted to talk about some of the other systems that feed into this because what we're discovering is that we're going to need to set up this sort of what we're referring to as the data ecosystem. So the various different services and utilities that interact with each other to provide a broader network of data capability. Now, here I've got this sort of big database store of SAP. That's our organizational information system. Although I believe there's actually more to it than just SAP, so take that with a pinch of salt. But anyway, so that's one system, but it doesn't necessarily present the data for easy use by other systems. So what one of the teams in IM&T have done is create a series of different web services, which just for developers really, I don't think there's much in the way of a user-friendly interface to any of this, that takes information from here and from other sources of CSRO, so like the Publications Repository, and formats that in a way that then any other, say, service or application inside of it can then use. So the Data Excess Portal grabs our business unit information and our project information from SAP, but through these web services. So we don't really need to know how SAP works, we just need to know how these web services work. And yeah, there are a really great piece of infrastructure that sort of sits underneath things. And I think a lot of the glory goes to the end applications that wind up doing great things with these, but these basic web services are what support that and enable that to happen. And so then once we're in, so this is just a screenshot of the organizational data coming through to the Data Access Portal. So the user is entering the information, they've selected their business unit and then they select their team. And there's other information in there that goes so you can get project information, information about who the project leader is and things like that. And of course we could integrate more of that in the future. So for instance there's an interface to our Publications Repository, so we would be able to conceptually say, what are the related publications? I think from these, you're listed as an author, that sort of thing. Now another thing that happens here with the DAP is we've got various different research groups that have already got metadata in their own systems, their own databases, and they wanted to put these things in our repository. So these formed pilot projects. So we had a microscopy group that had a very complex database and a lot of information about their microscopy images and they wanted to transfer that over. So they've got their own interface for doing that, which sort of semi-automates the process. So the system grabs the metadata out of their databases and then gets the user to just polish up the record, finish off the complete DAP collection record, and then they have it in the repository. And a similar thing goes on with the astronomy collection, so the first one we set up was the one for pulsar observation. And this represents a very large volume of data that also gets used quite a bit and internationally. So it's been quite a success and they also have very specific information about the radio astronomy data that they have a custom search. And a recent addition to this is the another set of radio astronomy data. This is for the from the Australian Square Kilometer array Pathfinder project. So I have a slide about that now. So this is just a sort of indication of the workflow that really it ends with one point in the data access portal, but there's a lot that happens before this. So I hope these are antennas and eventually there will be something like 32 of them and they transfer a tremendous volume of data to what's called the correlator. In fact this data, the volume of this data is really too big to actually store. It would be extremely expensive to set up a supercomputer that could do that. So what this correlator does is that then compresses that data down into something that can be transferred to the the paucy center and that that data gets processed into what is actually stored. So rather than being I think at full capacity it's going to be something like five petabytes a year. I didn't write the number down, but it's a truly scary volume of data that they're dealing with here. So they crunch this down and then actually store what they're going to store in the data archive. And then they have several different ways of interfacing this. So they have what they call the CASDA application. So that's the CYRO has CAHPS Science Data Archive and the astronomy community generally uses these virtual observatory tools which are like programming interfaces to access the data and they can query the data. And just because you can't really download this volume of data that would be crazy. They use these programming interfaces to just query it for the data they are interested in. But there's also part of that is gone into the data access portal so that people that can just use that user interface there and also search for data and request just the small portions of data that they do want to actually download and use. And then this is another one of the systems that the data access portal interacts with. This one's it's an open-dap sort of threads server and this is just one example of a fairly popular collection we have which is our a hind cast of ocean waves. What they do is they create these net CDF files and the net CDF files have embedded metadata. They have information about what fields they are and we can see over here that we have information about the layer, what the units are in. There's a whole catalog of various different datasets stored this way and the beauty of the open-dap services is that it provides various methods of accessing the data that just by default. So here I've shown an example of the mapping service. This is just a running in my browser but in theory another spatial portal could just link to the data services that run on top of this and then access the data in their own mapping, custom mapping service for whatever purpose they happen to have in mind. And so what we have in the data access portal is a set of metadata for the collection at large and that just points to these open-dap services and at the moment we only have a handful of collections that use this but they are very large and then the hope is to improve this so that any researcher can just point to an arbitrary open-dap server and say this is where my data is and that could include services at NCI or other other institutions. So here now oceans and atmospheres this used to be called marine and atmospheric. Now they have their own metadata catalog for quite some time and they have their own data center with various databases and ways of accessing things and what came up for them is that the new ship the RV investigator is collecting a lot more data than the previous vessels ever did because it has a lot more instruments and higher resolution instruments so they were looking for a way to store this data securely and also minimize the problems of actually transferring that very large volumes of data over the network. So what we do is we grab metadata from their system they can write it still using the way they used to doing things and they transfer the tapes to our system and we can actually just plug those tapes straight into our tape library and it behaves like other depth collections that have already been prepared and so what happens there is that they are able to share that with their university collaborators for the quality control process of the data that comes off the ship and then prepare that into data products that then go public. Now here's another example, Land and Water that's a group within CSIRO and they also have had a very high standard of data management over the years so they have this metadata catalog that they use internally and they use this through the working life of their project so they have a have a files server set up and they have very strict naming conventions on the folders and files and then they as they go they what certainly what they encourage their researchers to do is to write metadata this is just Anselic standard metadata in their system and then what we've done is we've so for this one and for the marine and atmospheric one we've enabled the ability to just basically upload the X and L so this at the moment that's Anselic or marine community profile metadata can go in through here and when that's done so this is just an example of one that's come through so they've that's created the depth record they can move their files in there and make that public or they can just share it with individual people so we have different restriction levels that we can add to different collections so we can restrict access to the files but leave the metadata public or we can leave respect access to even the metadata so there would be some collections in the system that I can't find even as an administrator because I'm restricted from accessing and now one thing I'll just get back to this previous slide so you'll see here that this one gets a DOI and what's we've got some policies around what gets a DOI what does not get a DOI so for instance this one has restricted files we can see a little padlock here in the files and when we click on that we get told up you'll have to log on to access that so this for whatever reason maybe commercial sensitivity or the licenses of source data they are able to share this data what they get is they get a handle instead now we are currently discussing amongst the team about maybe relaxing this so that metadata records can get a DOI there's a little bit of debate going on there but certainly the the researchers would be very keen a lot of researchers are very keen to use DOI as a preference for anything that they site and then this is an example so this is a licensed dataset I didn't want to show one that I can't show to anybody because it's restricted so this is just one that you can get through GS Arts Australia and we've got our own copy in there just available to Syro staff so that they don't have to go download it again and so when we do that we don't get a DOI and we don't handle this is just using an internal ID system and so this this will only show up to Syro staff for long on and then we've got version control so this is a bit of an example of a software collection that's been going through a few different versions and each time a new version gets created if it's a minor update say just fixing a typo then the DOI is maintained so you'll get several versions but you'll only get but you won't get a new DOI but in the instance where the data changes so that this is an actual new version of the software and you can see that there are also subsequent versions that have new contributors to the data so when those versions come through so if they've changed something significant about these attribution statements people in the title of the collection they get a new version and they get a new DOI for each one so what's in the future oh yeah and one thing I have completely neglected to talk about is the development we're doing of a data management plan tool so if we think back all the way to that workflow where we were looking at I'm gonna no I want to get back to many slides the when we think back to that workflow where we have the different categories of storage while researchers were working on their data before publication this is they are starting to collect information about the files and information about what happens to those files as it goes and that combined with the data management plan tool where researchers describe what they're planning to do and how they're planning to store things we would like to get that metadata feeding into depth collections so that they already have things written by the time they actually go to create the depth record and it's less so people maintain that metadata as they go so there's much lower transaction costs as one of my colleagues says when they're creating the metadata rather than remembering everything right at the end another thing we're working on a number of research groups have been very keen on us setting up some features that would support linked data so we've got semantic web features coming up like a persistent URL service which is a generally useful thing what we find is that a lot of researchers would want to get DOIs but they may not really understand the policies around using and maintaining DOIs so they might think of a DOI as just a persistent URL that points at whatever you want it to point at which is maybe more like what a Perl would be so I'm seeing a need and certainly for linked data there's a need to have persistent URLs that you can define the policies around and so having an institutional wide persistent URL service that both the data access portal could use but that any research group could use for tracking any object it could be a person it could be a data file it could be a piece of software and this would lead in towards things like provenance tracking where you can actually identify each thing each part of the process of the research workflow and record this and then that should improve transparency and reproducibility of the research itself we also are looking at vocabulary services because that will be much if for nothing else improving the way we enter keywords into our collections but there are numerous applications for vocabulary and then so the main thing that this semantic web features is not to have say we're going to implement this all in the data access portal because there are parts of CSRO that would like to use these services so things like that persistent URL service maybe its own entity so I think the thing that the researchers are really looking for is the persistence and the reliability that still being there in 10-15 years time and they want so that you know they might be working on short-term projects they may not be able to guarantee that sort of support for themselves but they're hoping that the organization can support that and have that commitment to it and the other thing we're working on with the web services interface is programmatic creation of collections particularly for data collection that is fairly routine so we have say some geologists who are taking a lot of samples and scanning them and they would like to just feed those scans and information about those scans straight into a data collection and that they can then reuse later and so rather than manually going through and creating it each time which is really infeasible at the moment having a program that can do that for you that's what they're looking for and so we're in a testing phase of that right now so I can't get through this without acknowledging the support of the Australian National Data Service who funded quite a lot of the development of the data and also that I took one of those slides from Air Corner and RNA Tie-House from their presentation at e-research last year there is a cost of thousands that have worked on this over the years and this is by no means all of them these are just a few people that I'm working with lately so thank you to them and thank you for listening in do we have any questions? Yes, Dom, we do. There's a couple of questions here What do you use for project IDs? Is there a national service? No, we're just using internal identifiers for projects so this is really just they're probably codes that wouldn't make much sense outside of CSIRO that's why I didn't really go into a huge amount of detail about that so I guess there's certainly four projects that do have that sort of national scope there's no reason why that couldn't be set up but at the moment it's really codes that are sort of specific to how SAP works. There's also a comment from the same person who says that they're pushing ORCID to supply these not quite sure that would work as ORCID is pretty much just a personal identifier as far as I remember. Yeah well that's definitely on the list of things that we're trying to implement so we want to register ORCIDs and because it's an opt-in thing we can't just say hey everyone in CSIRO here's your ORCID they have to actually volunteer for that but yeah we definitely want to link ORCIDs from both CSIRO researchers and external collaborators who are listed into records that those are one of the identifiers that I think would be very useful and so when I was talking about persistent identifier services you know to fill in gaps because there are all types of objects that could benefit from this that might not be covered by an internationally recognized service. The next question is what is the policy you mentioned around DOIs? Is it publicly available? I'm quite sure I'm talking about it. The policy is publicly available or whether the data is publicly available. Okay so the policy on DOIs I'm probably going to have to pass the buck on this but I believe we can share it I'm not sure if we've got it I don't know if I could point to one website that posts it at the moment so I'll either well yeah I'll pass the buck to one of my colleagues if I can get you to take a name down for that I'll be happy to get back to you. She's come back to say the policies that she wants to have shared. Yeah I believe that is actually part of our if Sue Cook is in here you could pass control over to her and I'll give her the voice I think she's got some comments to make about this. She has made a comment that says the CSIRO DOI business rules are on the ANS website. Oh there you go. Okay. Others have asked for lots too so they are on the ANS website. Okay another question is are there are any of the open-dap components available for other research institutions to implement or use? Yeah I believe open-dap I don't think we have a custom implementation of that. That's actually managed by a sort of data services team in CSIRO who I linked to the high-performance computing team so I don't I'm probably going to guess that none of them are in this webinar but yeah as I understand I believe that some open software that they have implemented there so I don't think there's anything special about what we've done. Certainly I know the NECDA or NCI have some threads servers going and I'm sure the Bureau of Meteorology do too so I don't think there's anything particularly novel there. Maybe there's a few implementation details but I can definitely put you in touch with Gareth Williams who would be more than happy to talk about that. Jerry's made a comment, Jerry Ryder who's from Anne, she says with regards to pearls for projects Anne says a service for ARC and an HMRC grants. Okay and I think for the moment that's all of the questions that have come in. Thank you very much. All right thanks Tom.