 Welcome to everybody attending. My name is Keith Russell. I'm the host for today. Thank you to Suzanne Sabine and Natasha for co-hosting the webinar. This webinar is part of a series co-sponsored by ANZ and CALL on the theme of research data information integration. So from our perspective, research is changing globally and one change is the greater focus on research data as a research output. So in Australia, the Australian National Data Service, so ANZ, collaborates with the research sector to produce remote, better management of research data and encourage a change in research culture to recognize research data as a first-class research output, better manage it and describe and publish research data and enable more future reuse of data, for example, through efficient discovery of research data. So this webinar is part of a larger series of webinars and face-to-face activities and also a virtual exchange around research data information integration and we've set up a web page listing those various activities. There you can also find recordings of previous webinars in this series. You can also find links off to other bits of information and the virtual community space that's been set up for this purpose. So the work around research data information integration, we've set up according to the phases in the research data management life cycle. So this life cycle approach allows us to look at the business activities and the underlying IT systems that support these activities as a coherent set of purposeful activities and systems to achieve organizational outcomes and best practices. So in this webinar series, you want to examine different systems that are used by different institutions at different stages in the life cycle to manage research data. We've heard a few already and today we will be hearing about the element of archiving and preserving the research data. So in previous webinars, we've been looking at data management planning, ethics for research data, storage allocation and publishing research data. But as we were going around this life cycle circle, we noticed that we were also really interested in examples of where institutions or organizations had been looking at archiving and preserving research data and systems in use for that purpose and also where those systems might be connected in with other systems. We are very grateful for Jenny and Chris to be available to talk about their experiences in this example. I'll introduce them in a second. So before we get there, I just wanted to acknowledge first of all our co-sponsor, the Council of Australian University Librarians that's helped us with this webinar series and has been advertising this too. And of course, I would like to acknowledge and thank the National Research Infrastructure for Australia, the NCRIS program that funds and makes this all possible. I would like to introduce our two panel speakers for today. Jenny Mitchum, who is a Digital Archivist at the University of York, and Chris Orr, Head of Information Services at the University of Hull. Both speakers have been working on a project called Filling the Digital Preservation Gap. This was a GISC funded project in the UK, which involved investigating using Archivematica as a tool for preserving research data at their respective universities. I would like to welcome Jenny and Chris, and thank you for making your time available at this late hour for you and to share our experiences with us. So I'd like to hand over now to Jenny. So I'll get started. Keith's already introduced us, so thanks for that, Keith. I was really excited when I was putting this presentation together that I could actually put two dates on the slide. So it's actually the 7th of November here, although it's the HPE, but we're still in Monday. So yeah, we're going to talk a little bit about our Filling the Digital Preservation Gap project, which is a project we've been working on over the last probably a year and a half now. So I'm going to start, and then I'll hand over to Chris, and we'll keep on passing it backwards and forwards until we get to the end. So we'll start off with an introduction, a bit about who we are and why we're doing this. So just to give you an idea of what the University of York and Hull are looking like, this is just a quick overview. We're both working at universities based in the north of England, and at both universities there's a fair bit of research activity going on, and obviously along with that we get a whole lot of research data that we need to manage in one way or another. Why do we need digital preservation for research data? Well, this is a question we ask each other a lot. We feel that we can't ignore digital preservation, though it's tempting to do so because it's actually quite difficult. But the particular reason I guess why we've really been looking at this over the last couple of years is because of fund mandates, and in particular in the UK, the one that we talk about an awful lot is the EPSRC, which is the Engineering and Physical Sciences Research Council, and they've set these moving targets around data retention so that you have to keep your research data for ten years from the date that that data was last accessed. That I think means that we really need to take this stuff seriously because if a data set is accessed nine years into that time scale, then it restarts the clock and we have to keep it for another ten years and obviously if it's accessed again and again, this does mean that we might need to keep this stuff for quite a long time. We often get quite sort of obsessed by the EPSRC mandate in the UK, but there's also other fund mandates that we need to look at and think about, a lot of which also talks about long-term retention of data. So the NERC, STFC, the Welcome Trust, they all kind of say that things might need to be kept for longer than ten years. And in my mind, once you have to keep something for longer than ten years, then digital preservation really needs to become a part of that in order to ensure that data is usable and reusable over time. This kind of feeling that we had is also backed up by a survey that we carried out in York where we actually talked to our researchers. This was a couple of years ago. We asked them lots of questions about data management, what they did with their data, what they intended to do with it at the end of the project. And one of the questions at the end that we asked was, which data management issues have you come across in your research over the last five years? And one of the options under that question was an inability to read files in old software formats on old media or because of expired software licenses. And I was actually quite surprised to see that pretty much a quarter of researchers at the question stated that that was a problem for them already. So that kind of shows that digital preservation is quite necessary and will become more necessary as time goes on. So we were quite excited to see a funding call from JISC. It was their Research Data Spring Initiative. I think actually Chris gets the credit for this because it was Chris that initially put an idea into this funding call. So JISC were looking for ideas for technical tools and solutions to help with any aspect of research data management. Chris put an idea forward about using Archivematica to help preserve research data. I noticed his idea and I thought, well, that's a good idea. I'd like to jump on the back of that. So I offered to collaborate on this project with him. We then had to go through a process with JISC of bidding for the kind of pitching our idea and bidding for funding as part of this initiative. And luckily we were successful. So the ultimate aim of our project was this. It was to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for research data management. This is just a picture of a very quick poster that I drew up to try and advertise our idea at one of our Sandpit events. I don't think JISC gave us a funding purely on the basis of my poster. I hope not anyway. But yeah, it just gives an idea of what we were thinking at that point. So the project ended up being in three phases. For each phase we had to bid for further funding in a kind of a Dragon's Den-style scenario. I don't know if you have Dragon's Den over there, but it's quite daunting bidding or pitching to a panel of judges and finding out whether you're going to get funded or not. So the phases worked as follows. So in phase one, we had three months and we were really focusing on just exploring our ideas. It was a bit of a feasibility study looking at testing Archivematica, researching Archivematica and thinking about how we might use it. Then we moved on to phase two. We were lucky enough to get funding for that. And the focus on phase two was about development, developing. We wanted to make Archivematica better, specifically better for research data. And we wanted to also plan our own implementations. We had four months to do that. Then in phase three, which we were again very lucky to receive funding for, we had six months this time, a luxury of six months, to do some implementation work. So this is where all the talking had to actually turn into proper action. So in phase three, we were setting up proof of concepts of Archivematica for research data at York and Hull. And we did a bit more investigation into what I'll call the file format problem. I'm going to talk about that a bit more later on. So that's kind of the rough outline of the project. And this is the team that was working on the project. So Chris and Richard and Simon at Hull. And there's myself and Julie from the University of York that were also part of the project team. So now I'm going to hand over to Chris who's going to talk about our work in phase one of the project. So what were we trying to achieve? We initially was going to be a feasibility study in phase one to look at what research data looks like and how Archivematica was going to be able to allow us to preserve that data. We were also interested in understanding how Archivematica might integrate with other systems that we were using for research data management. We never anticipated that Archivematica could do everything for us. It was more a case of how it could add to our overall RDM workflow. We also wanted to find out a bit more about what our institutional requirements were for digital preservation of research data and whether or not Archivematica would meet those requirements. And as much as, well, we would keep going into this with an open mind to the extent of being open to whether or not Archivematica could meet our requirements or whether or not it would fall short in meeting those requirements. So just quickly to say that that feasibility study was written up and published as a report, I'd go into a bit more detail about our investigations that we carried out within that first phase. So ultimately to start off with, what is Archivematica? So for those of you who are not familiar with it, it is free and open source digital preservation software available under that GPL license on the slide. And it's intended to maintain standards-based long-term access to digital objects. It uses the OAIS functional model as a way of helping to process those objects to make sure that you know what you're putting into the system, you know how it's been processed, and you know what you're getting out of the system at the other end. And essentially it does this by carrying out a number of tasks, what I call preservation tasks or preservation actions, through a series of microservices. So for example, carrying out format normalization, preserving the originals, enabling emulation and migration further down the line by capturing relevant information around the object. So it's not just being stored in isolation, it's being stored with relevant information associated with it. It does this by, as I said, mentioned a series of microservices by bundling together the series of these microservices. Now an awful lot of these microservices are actually available independently as web services and could actually be tied into a number of different systems. And there has been, for example, in our experience, people looking to see how they could embed some of these microservices as part of the repository workflow. We tried ourselves a few years ago to integrate the Droid and Pronom services directly into our repository ingest workflow. But ultimately, given the number of services that are available and the number of actions you might like to carry out on these files, it's actually felt appropriate to think, well, is there a way in which you could actually take those and combine them as a separate service rather than try and maintain all of these things within the repository itself? And that is what ArchiveMatic offers and therefore gives you the flexibility to develop pipelines consisting of a series of these preservation microservices but also developing maybe different pipelines to carry out different microservices and different actions depending on the type of material that you're looking to preserve. So, for example, in working with research data we might actually want to treat it in a particular way whereas if we were dealing with traditional born digital archive material that University Archives might receive from, say, an external depositor we might like to carry out a series of different preservation microservices because we need to hold or manage the content in a slightly different way. Ultimately, what ArchiveMatic is designed to produce is a high-quality, standard-compliant archival information package referencing the OAIS there again which it generates as a baguette and within that baguette it holds packaged information within a METS container and also makes good strong use of the premise data dictionary to capture that associated information around the content that I mentioned these microservices generate. I mentioned ArchiveMatic as open source and free software. It is software that's generated originally by a company called Artifactual. The logo is in the center there. But their model is to make the software openly available to anyone who wishes to make use of it and as an open source basis under that license so if people wish to make their own adaptations or add to it then they're free and able to do so. Having said that, what they've also sought to encourage and I think what most of the users like to see about ArchiveMatic is that everyone who contributes towards its development feeds back their code so that the core of ArchiveMatic is continuing to develop. At Hull certainly we started looking at ArchiveMatic originally back in 2011 and at that time you could see the bare bones of what it has become as a system but it was still a little raw around the edges and rather simple in the way that it was presented, what it did. When we came to look at it for this project and indeed the reason why we chose to look at it for this project was because our experience looking at it again towards the end of 2014 was that it was a lot more mature and was capable of providing us with the preservation services that we wanted to have. I've listed some of the other institutions which have taken an active view and interest in ArchiveMatic and have also contributed towards its ongoing development over the last few years. There are many others as well. These are some of the ones that we've encountered and worked with as part of our interaction with the ArchiveMatic community. The question then relies on why would we recommend ArchiveMatic for research data management. There's a number of reasons here. I'll mention each of them briefly. It is flexible and can be configured in the way that I mentioned. It's a flexible pipeline that you can combine different services in the way you need to. It allows many of those tasks then to be carried out in an automated fashion because you can chain together these services and then set them going in an automated fashion. It can be used alongside other existing systems as part of that wider workflow for research data. So it's not seeing itself as a solution solely in itself. It offers the ability to carry out digital preservation actions with limited resources. It's free open source software. It doesn't require huge amounts of effort to get it going. And it's not one of the commercial solutions, I suppose. It has an evolving solution. It doesn't pretend that it's complete necessarily because we don't think that digital preservation is a known solution set at the moment. But it is being continually driven and enhanced by members of that community in order to improve what the tool can do. And it ultimately enables us to say that we have some way of knowing that we will be able to maintain access to this research data, particularly where that research data is of a slightly unusual file format over the time periods that the funds are asking us to manage it. Having said that, we can't pretend that it's perfect. It isn't a magic bullet. It does require work in order to get it to do what you need it to do. We can't guarantee that the data will be readable in the future. All we are doing is capturing relevant information about the data and the file format and the technical information that may be appropriate around the content in the same place so that when we come to want to access it later on in the future, we have the relevant information to guide us on how to do that. We can't guarantee that the tools will be available at that time as well. We're not doing software preservation here at the same time, although that is another area that may need to be considered. It is as good as current digital preservation practice, and I think its development in the last five years has mirrored ongoing and developing understanding of what it means to carry out digital preservation and make it a regular practice that can be carried out. I mentioned it doesn't require any resources. It's fairly straightforward to get up and going. In principle, yes. It depends on how much configuration you want to do to it. They're inevitably like lots of other systems in terms of getting it installed in quite the way you need it to. But if you work through the issues, then they can be sorted out. The graphical interface isn't that intuitive. It's designed primarily to focus on the back-end services, but that has improved over time, and no doubt hopefully it will continue to improve. I don't think anyone is pretending that if you run these services that you are not going to need staff who understand what those services are doing, those preservation microservices, so that you understand what you're getting out at the end of it so that you can effectively manage what is in the baguette file that the tool produces at the end. You could use Archivematica in a number of different ways. You could host it in-house and link it to an existing repository. You can use it in-house and just use it as a standalone system because you can actually associate a store alongside it if you just simply want to push data through it and then put it somewhere to store it rather than put it in a repository. There are a number of hosted instances of Archivematica now. For example, DuraSpace is combining Archivematica with DuraCloud Storage so that you can push content into that storage through Archivematica. And then there's another example there is with the company Archivem, which is also doing a similar thing. And we did do some investigations that uncovered the fact that associating Archivematica with storage or near the storage is the most pragmatic way of operating it. So it depends to some extent on where you're planning to store your research data as to where it's most logical to locate and then use Archivematica. Okay, so I'm going to talk a bit about our Phase 2 work. So as Chris said, in Phase 1 we did decide that Archivematica was worth exploring further for use with research data. One of the things we did in Phase 1 was actually looked at Archivematica against our own digital preservation requirements. And we found that it fell short in several areas. It had potential, but there were certainly areas where it could improve. So what we wanted to do in Phase 2 was to try and make it better. We wanted to work with architectural systems. Chris has already kind of introduced the model that they have around different organizations sponsoring different developments. So we wanted to work with them to make Archivematica more suitable for managing research data and preserving research data. So we did this work between July, last year and January this year. We worked in six different areas, which I'll talk you through. We were working with Artifactual Herb based in Vancouver. So we had these weekly Google Hangouts to catch up on progress that we had to account for the time difference in all of those Hangouts that wasn't quite such a big time difference as it is with Australia. So we managed. All our developments that I'm going to talk through now will be available in Archivematica soon. I think one of them's actually already released in 1.5, but the rest will be released in 1.6 or shortly afterwards. So the first development area was designed to solve the problem that we'd kind of highlighted in our Phase 1 report, which was that research data needs to be kept for reasons of compliance so that we could reproduce results and such like, but we don't know actually if anyone will ever want it and it might be huge and we saw this as a bit of an issue. So certainly at York what we didn't want to do was create a dip or a dissemination information package each time that we archive a dataset because essentially that dissemination package might just be pretty much identical to the archival information package so it will basically involve storing the same thing twice and it might never be used. So we wanted to implement something in Archivematica which meant that we could hold back on creating that dissemination package until such a point as it was actually requested. So for another researcher requests the information and yes we're kind of happy to produce the dip and make it available through the repository but we didn't want to do that until we knew that there was interest in that dataset. So we did a bit of work to enable that dip to be generated on request rather than as part of the initial ingest. So that was a deliverable one. The second piece of work we looked at was around the problem of wanting to pull that dip when we created it into the repository. So at York and Hull we both have repositories that are based on Fedora and Hydra. Now Archivematica has got kind of connectors or integrations built into certain types of access systems such as d-space or content dm or atom but it hasn't got a connector to Fedora. What we didn't want to do as part of this project was try and build some kind of direct connection with Fedora or Hydra the reason being that Fedora and Hydra are so flexible in themselves that we would do something that would meet our needs but it wouldn't meet other Fedora users. So what we wanted to do was to work in a more open way and just allow, kind of do some development which enabled any repository system to try and pick up a dip and try and work with it or to make it easier for that to happen. So the solution we came up with here was a library to help with parsing and creating Metz files and there's a link there to GitHub where some of that code sits currently. Deliverable three, something that we really highlighted in our phase one report and our requirements work was that Archivematica isn't very good at reporting. For reasons of kind of being able to talk to our senior managers and talk to our funders and talk about compliance we really wanted to be in a situation where we could report on what we have within Archivematica. So just being able to answer basic questions about the number of data sets we've preserved, the number of files in storage, what formats we've got, what date we ingested them. And Archivematica really is a very much back end system. It's not designed to do reporting or produce fancy graphics to show what you've got. So what we did as part of this deliverable was enable Archivematica to kind of spit out its data or relevant bits of its data about the AIPs within it so that third party reporting systems could pick those up and work with them. So we're not actually building that reporting functionality into Archivematica but we're trying to work in a more flexible way which enables institutions to use whatever system that they would like to use for reporting. So that's what we did for our third deliverable. The fourth thing we were looking at was something to do with the size of research data sets. So we were aware when we were investigating what research data looked like but it might be quite large in some instances. And one of the problems that was mentioned when we talked to lots of people about Archivematica was well, how will it deal with really big data sets? Is that going to cause problems? We wanted to try and reduce the bottlenecks and one of the ways that we thought that we might be able to do that is by changing the checksum algorithm with an Archivematica. So Archivematica has always had a built-in kind of checksum creator that uses SHA256 which is one of the more thorough checksum algorithms which is great to have that option but what we wanted was to enable other options within Archivematica so that if you liked you could use a different quicker perhaps checksum algorithm like MD5. So that was our deliverable four enabling the institution to configure Archivematica to use whichever checksum algorithm they wanted and of course that will be recorded in the premise metadata as well. Number five, this related to another specific problem relating to research data and that's the file formats problem which I'll talk about at the end of this presentation. So research data file formats come in lots of different formats many of which we can't identify. Archivematica can't identify them because the tools within Archivematica can't identify them because they're not a program which is the database of file formats. We knew that we couldn't solve this problem it's far too big to try and solve and it's not just a kind of an Archivematica problem it's broader than that but what we wanted Archivematica to do was to be able to report on which files it couldn't identify so that would then give us the opportunity or give the Archivematica operator the opportunity to take action and say oh yes actually in that batch we've got lots of files that aren't identified perhaps that could trigger some kind of action around checking out those files themselves perhaps submitting them to Pronom so that file signatures can be created but more about that later. Lastly and this really wasn't so much a development it was more just a thing that we wanted to fund which was some additional documentation around Archivematica and specifically documentation aimed at developers because we realized that there was a problem in that Archivematica was quite hard to adopt there were quite a lot of barriers that might stop institutions trying to adopt them one of which was the fact that the documentation wasn't always that thorough or intuitive so we just managed to fund one little piece of work under this one which was to fund a webinar which described Archivematica's automation tools and automation tools are a way that you can basically automate workflows through Archivematica so it was something that we thought was really important in the research data preservation sphere because none of us really have huge numbers of staff or resource to actually tackle this so if we can configure things to be automated that's all the better. So to summarize Phase 2 what worked well was we did enjoy working with architectural staff they were always very keen and enthusiastic and patient with us they also have a bigger picture in mind and they really wanted to understand our use cases but also bring in use cases of other users to try and make sure that the route that we were going down was going to be sensible and broadly useful for their user community. The work that we carried out built on work that others have done so quite specifically work that's sponsored by the Zeus Institute in Berlin around re-ingesting AIPs but now it's also great to see that it's being built on by others and there's more work in the pipeline from Simon Fraser University and some work that's just been completed actually in a monthly historical library that's kind of building or using the stuff that we've funded within Archive Matter Care so that's great to see. What didn't work so well? Well I think perhaps we were quite ambitious we couldn't decide between the six areas that we wanted to fund so we decided we'd fund a little bit of all of them which meant that in some cases they were only partial so we kind of started to bite off the kind of the problem but we didn't sort of solve it entirely. So the checksum deliverable we did that was kind of a complete thing that had a very clear start and end but some of the things like the search AIP or integration with Fedora there's so much more that we could do on that and we just had to draw the line somewhere. However saying that about the checksum work it was quite hard in the end to prove the impact of it once the artificial systems had released the code for us to test we did a bit of testing and we couldn't actually see any time saved by choosing our MD5 checksum algorithm over the SHA256 which was disappointing because obviously if you funded a bit of work you really want to see that it's made an impact. Though having done further tests after that we do believe that if we tried it on large datasets and at scale that it really would have an impact it's just we were doing quite small files at the time and also solving the file identification problem is a huge task and needed more thought so that was something we wanted to revisit in phase 3. Oh lastly in phase 2 the other thing we worked on was our own implementation plan so we wanted to plan how we would actually implement our codematica ourselves at Hull and York and this was really key to us for us because as we realized deciding to use a system is really quite easy but deciding how to use it is much much harder there's a lot of decisions to made about how to configure it which other systems it talks to how data flows through the system where humans need to dive in and interact as well. So we created separate implementation plans for Hull and York they were different because we have different RDM systems in place despite both having Fedora and Hydra and we had different institutional needs and priorities so both of those implementation plans are available in our phase 2 report. So I'm going to hand back to Chris now to talk about phase 3. So just to sort of describe both a proof of concepts, architectures and I think the basic difference or the main difference is where the initial data is coming from in the first place. So at the University of York they needed to want to provide an easy way of depositing data a way of monitoring data sets for the RDM staff and a way of requesting access data to data and the way they've done that is that they use the pure research information system and all researchers create a metadata record for their data within that pure system but the data set itself is not ingested through pure because it's not felt to be an appropriate system to push the data files through but partly recognizing the fact that data files comes in many different shapes and sizes therefore a separate, simple interface was generated in order to allow deposit of the data set itself. The idea then is that once the data has been deposited it can then get pushed through Archivematica and then combined with the metadata to create repository objects once it's been fully processed in a sense by the tool. So pushing data through the system to generate a report which allows research data management staff to monitor how things have been going so it gives an information back on what has been ingested what information has been generated from that and also then generation of a DOI and also links to where that data has been deposited and where it can be requested from so that they can create the full record for long-term archiving and management. One of the pieces of work that was associated with the work here at York was to model how to hold data sets once they've been created as a package of data. To do this they used a thing called PCDM or the Portland Common Data Model. This is an approach to modeling digital content that originated in discussions within the Hydra community but has been picked up more broadly by others in the broader Fedora community and in other sectors as well. Jisk is also interested in using it for some of its research data activity in the UK. The model has been developed. We can disseminate a link which gives into this in more detail so if people wish to reference and look into it afterwards they're very welcome to do so. But ultimately it's about identifying the data set and then how you associate the different pieces of information different pieces of metadata around that so that it becomes one coherent object as a whole when it's being managed within the repository after being pushed through Archivematica. So as it says, I'll take that word for it. The IDM staff love the proof of concept because it's enabled them to demonstrate the proof of concept to enable everyone to see that it is feasible to make this work and there is enthusiasm and agreement to turn it into a production system over the coming autumn and winter months and it's also provided that insight into how do you do data modeling which could be used for other types of data as well. One thing we'll mention again briefly a little later is that JISC has now initiated a JISC shared service for a research data pilot over the next two years and York is one of the pilot institutions and very much the work of this project has informed elements of how preservation should be delivered as part of that shared service. Going on to Hull then, we took a slightly different view because our origin could be a number of different places and we also wanted to implement an architecture which allowed us to manage not just research data but many other types of data as well. So that's why our university archivist was part of the team to keep us on the straight and narrow in understanding how we could apply this to the type of digital archive files that he will receive through the university archive. To that end, we have implemented a proof of concept that was based on a preliminary architecture that we disseminated last year. We've implemented this in this instance with the research data using box. So our university has an institutional box subscription which is where people can simply share the data that they have generated and then they then share the folder with a user called Archivematica and we then generated a tool that checks for anything that's been shared with Archivematica and picks it up, creates a initial baguette from it and then pushes that baguette through Archivematica which then takes on board all the outputs of that and creates a separate baguette which is then used as the basis for generating objects within our hybrid repository. So a slightly different approach taken but ultimately the same goal and the same end result at the end of it. So we wanted to allow people to have a number of different options so there's the idea of having multiple data files and one descriptive file. There's the option of having multiple files and a CSV file which has multiple lines of description as metadata and there's the idea of having a folder containing a top level folder so it has a structure built into it and replicating that hierarchy in what comes out of Archivematica further down the line and that informs how the objects in the repository are structured as well. We've documented these and that is part of our phase three report. For us in HAL then is to take the proof of concept work and turn it into our own production system over the coming year but our primary driver for doing that is while we have an interest in research data is also to develop a digital archive. In the UK we have the concept of a city of culture which is awarded every four years and HAL is the UK city of culture in 2017 and as such we know there's going to be a great deal of digital material captured as part of that which we want to capture in the archive for posterity and legacy from that year. So I'll hand over briefly then to Jenny to talk a bit more about the file format problem which emerged more we went on. Okay, I'm just going to try and whiz through this because I'm conscious of time. I've alluded to this already but we realized when we were looking at the problem of how we preserve research data that it's not all about just throwing a system at the problem and saying right there's your digital preservation system the problem's solved. We needed to engage a bit more with it. So we're looking at the specifics of research data and thinking about what archive might actually do with that research data and we realized that one of the main problems that we were going to encounter was the fact that the systems can't really identify research data very well and that does mean they're hard to preserve because all the kind of the digital preservation theory and models are based around the fact that you need to know what you've got and then you can take action. So whether that action is migration or thinking about emulation in the future you can't really do much unless you know what you've got. So this just illustrates in fact it doesn't illustrate so well what I'm trying to say which is that research data software applications and thus research data formats are incredibly numerous so what would be much more useful for you to see is what isn't on this graph. So we're showing you the top 20 research data applications in York and this information was gathered by running a questionnaire a different questionnaire to researchers within York. We have 328 responses to this questionnaire and in total there are 260 different software applications that researchers mention that they use to create their research data or analyze their research data. So what isn't really interesting is the top 20 applications. What's more interesting is the huge long tail of 240 other applications that just one or two people are using and that kind of points to a real problem around how we're going to manage the identification of research data and the future preservation. So I did a little exercise because we've been collecting research data at York. We had been collecting it for about a year in our repository and I thought it would be useful just to run droid over the research data to see if we could identify it. So droid in case you've not come across it is just a little free tool from the National Archives in the UK that can be used to automatically identify file formats. So it wasn't a huge sample. We were looking at 3,752 individual files but I was quite horrified to see that droid only attempted or gave an identification for 7% of those files. So only about a third of them really and this was with varying degrees of accuracy and I say that because droid can identify either by file signature which is very accurate and by container which is also pretty accurate or by file extension which isn't so accurate because obviously lots of files can share the same file extension. So I think about half of these identifications were by file extension which perhaps suggests that they weren't that accurate. So what wasn't identified? This graph I don't know if you can see it but it's listed by the file extension and showing how many files weren't identified of that file extension. So 107 different file extensions not identified. That big column at the start of the graph is blank so it's files with no file extension at all so I put help because that's a real problem. We don't have any clues there as to what those files are. And then the second column there is dat files. So dat files is another big problem because if we tried to identify that by extension we would fail because dat files are very numerous. There's lots of different types of dat file and lots of research data applications spit out their own version of a dat file or a data file. So this is a problem. What we wanted to do in phase 3 was do some work towards addressing this problem and try and encourage the community at large to try and address this problem. So one thing we did was work with the National Archives in the UK and we actually funded a little bit of signature development work specifically around a few research data formats so that we could increase the number of research data formats in pro nom. So that's just a screenshot of one of their signature releases this summer which included some of our work. We also had a go at creating our own file signatures and this was quite a fun exercise really. I think we just wanted to work out whether an average digital archivist could create file signatures. So I considered myself to be an average digital archivist so I thought I'd give it a go and I wrote everything that I did and how it went in a blog post. I won't give you all the information here but I was successful which I was very pleased to see. So I created a file signature for an omnic spectral data file which was one of the files which wasn't identified in that sample and I submitted it to TNA and it's now in pro nom so tools like Archivematica should be able to identify those files now. So I'm really keen that others in the community also try and do this because actually it's quite good fun. So back over to Chris. So just to highlight that we've actively disseminating the work of the project through it and we're very grateful to receive some external comments on the value that the work appears to have had in terms of informing people's thinking about their own way of implementing digital preservation as part of their thinking of what they do about research data management systems and I think that's still a very open debate and open discussion within the UK certainly but it's very pleasing to see the positive impact that the work has had in helping to stimulate such discussions and one hopes that it will have the same impact in your own situation. Just to highlight some of the challenges that the work has had which has probably had an impact on it we have short focus timeframes which in some ways did focus our minds but they had quite short lead-in times which meant that we probably didn't get the most benefit of those short timeframes even in themselves and I think we all suspect we could have done more if we'd had more time and time to prepare and time to carry out the work but there we are we've been able to do pretty well in the time we had. Access to appropriate skills mainly technical development is a general point that impacted on this project but affects other projects as well. Limited budget therefore we took a parsimonious approach for doing what we felt we could doing what needs to be done rather than just doing everything that we would like to do and there's a key practical difference in how we carry out preservation on that basis. We also had a challenge of recognizing that the term digital archiving tends to get interpreted differently across the preservation research data management and IT communities and therefore that can take rather longer than you think to try and work out and trying to come up with a common understanding and definition of that term at the beginning would be a very useful starting point lots of dissemination and balancing that with actual doing although both were fun. What we have learned Archivematica can be used to manage preservation of research data and it can be used to embed that within similar but different institutional workflows. It's useful to get focus systems to do what they do best rather than necessarily just trying to add all this workflow into the repository or other systems we may have. There is a file format recognition issue for research data files and actually for more people who gauge with the file signature generation process the simpler it will be to solve this going forward a real crowd sourcing opportunity there and it has to be said that the real benefit has been working collaboratively both within the project but also generally with the Archivematica community the research data management community because it means that we've been able to tweak and adjust what we've done as we've gone along to make it most practically useful to as many possible people as possible. So I mentioned earlier a little bit about the service. So a little bit more about this this has been informed this is a big initiative by the GIST to develop a shared service for research data. I think picking off the op-up of the fact that lots of institutions are struggling with this and are not necessarily finding it feasible to use existing systems and there doesn't appear to be necessarily a developing market out there for research data management systems per se. So they've broken it up into a whole bunch of lots for which different systems can be used and are looking to combine those in different combinations across the different pilot institutions to identify exemplars of research data management services as a whole which could then be delivered on a shared basis. So Archivematica is being used for that the image here highlights that the different components it's one you can study in more detail at a later time but preservation right in the middle there which is where it needs to be highlights that they will be looking at using Archivematica for that. They will also be looking at using Preservica for that because that's another component and they're looking at multiple components for each of these different lots and very much the work that we've done within this project has been informing how that gets taken forward. So we have a website that you can look at which takes you off to the various pieces of information plus a blog that Jenny in particular adding to over the duration of the project and all the reports are housed within FigShare if you search for the project title at that site there for future reference but otherwise, policy is slight delay, we will end there but happy to take questions in the time remaining. Thank you Jenny. Thank you Chris. That was really interesting and a really good overview of the whole project and it's fascinating in such a short time to have done so much and I'm just looking if there are any questions. Susanna of you? Yes there are. There are questions there. The first one says it's interesting to see what data application, it is interesting to see what data applications researchers are using. Is this survey information available? That's a very good question and it's one I've been asked before it's not but I need to make it available because there's no reason why we can't make it available. There is actually there's a report that we created at the University of York which is internal currently but I really should publish it but there is a summary of that information in our phase one project report which is just shown on the screen at the moment so within that I think we reproduce the kind of the basic stats from that survey but that's a reminder for me to get out and publish the initial report. Fantastic. There's another one. The next one says apart from preserving data do you also preserve or link to contextual provenance information that will help to understand and reuse data such as models and softwares that generate the data? We definitely hope to. What we do at York is we collect metadata and the researchers are allowed to give us as much or as little metadata as they like and they're also allowed to attach documentation or include documentation within their research kind of data set that they deposit with this. Obviously we hope that they'll include lots of documentation and that they'll include enough documentation to enable reuse but at the moment we're kind of we're not pushing on that too strongly because we don't want to discourage them from engaging it's quite a tricky one really because I mean if we create a big form that they've got to fill out all the fields and they've got to submit certain documentation I think it might put a lot of them off so we need to get the balance right at the moment but I do understand that it's very important something we'll need to work on more in the future. I think it's something that we need to have more of a dialogue with researchers about to make them aware of the implications of using software that they may or may not have access to in the future or which other people may have access to. We're quite interested in working with there's a body in the UK called the Software Sustainability Institute which also has an interest in this area and it also makes sense probably makes sense for people to archive software in a more centralized way, my personal opinion simply because it's not necessarily unique unless it has been written in-house in which case maybe the institution has a responsibility for helping to manage it or keep it so that data generated through it can be repurposed more easily. Just to say as well, you'll see on the screen there's my blog post the recent blog post that I published was about the PASIG conference which was last week last week in New York very nice. There was a really interesting talk that I mentioned in the blog post which was talking about a tool called ReproZip and what ReproZip it does is as you package up your data to give to the repository it also packages up all the dependencies so that includes the libraries that you've used and the options that you've used in your code because it's actually really hard for the researcher to pin down what all of those dependencies are unless they've got a tool to do that for them. So I thought that looked really interesting and something that was worth exploring further. Fantastic. There's one more question which is after data is preserved for a number of required years is there any process by then which ones are to keep and which ones aren't? Yes, I mean at York we've actually with our research data management policy we've actually mirrored the EPSRC funder policy which I mentioned to you at the start which was this concept of keeping things for 10 years from data blast access. So at York basically we'll be doing that for everything so we'll be monitoring access through the repository and there'll come a point at which something hasn't been accessed for a long time that's going to trigger a kind of a notification to the staff to say this hasn't been accessed for 10 years it's worth reviewing this and deciding whether to delete it or not and obviously we're not going to reach that point for at least 10 years so we've got a little bit of time to worry about that before we really have to start getting something in place. But for the moment we're just thinking along those lines. I think that's where the majority effort has come is simply wanting to get hold of the data so that we can keep it for that period of time and almost force us into thinking through what we do with it at the end of it but in a sense if we do get it then it's again part of the dialogue with the researchers about how they see that data being kept over a period of time and see what they say it's going to vary. There is some guidance, the Digital Curation Centre has a very useful document that talks about how to apply different criteria to what you may keep and what you might not keep which isn't a definitive answer by any means but it's a set of tips as to what you might like to think about in order to help work out what to keep and what not to keep. There was also on that matter there was also a really interesting poster at iPraise this year and I can't remember who the poster was from it was an institution in America and it almost won the poster prize because it was a really good poster but it was about deciding what to keep and what to bin in the context of research data and it was talking about automated processes for flagging up the ones which you think you might want to get rid of using particular criteria that you could measure and then you wouldn't have to pass a human eye over all of your files or all of your data sets because it would just highlight those that you thought might be candidates for deletion but it was very interesting and I'd like to know more about that when they get further on with that project. Thank you. That's all the questions that we have there. Thank you. A big thank you again to Jenny and Chris just a reminder to everybody this is a webinar series so there will probably be further webinars coming up in the series so keep an eye on Anne's news there will be announced there and on the website for further coming up webinars. Big thank you again and I hope you have a good night's sleep after this and good luck with all the follow-up. Thank you very much. Thanks for having us.