 Thank you for the introduction. Good afternoon. Today I'm here to talk to you about the library's recently completed Mining Mars and Project. This project set out to create the Mars and Online archive, which was officially launched on the 6th of November so fairly recently. This presentation will cover the different pieces of the Mining Mars and puzzle in order to create the Mars and Online archive we had to complete four very important steps. usiaidiaz i darae pabadaw i regoioak힌u, geroi meldatao i kontentu, tigaipa i yeijiaa i te manusiuskripsu a daraea, pabaungu iru gaya gau. Pwede, totatapia, te bwéi ka betia. Kau mazdin? Mwaiia, miwafimu, Amalai. Isi gada-mwai, atero i-dawai i-dawai atero i-dawai i kuli atei earlys meroblis. Here you can see him delivering the first Christian sermon back on Christmas day in 1814. But Marsden bought more than the Christian message to New Zealand. He also taught Murray the art of agriculture, he imported sheep to New Zealand, and he was one of the first to document the Murray language. Dr Hocken, the founder of Hocken Collections at the University of Otago Library, came across Marsden's letters and journals when he visited the offices of the Church Missionary Society in London. Realising the importance of these historical documents and the significance for New Zealanders, he managed to secure the material and they formed part of the founding collections at the Hocken. December this year will mark 200 years since Marsden's first sermon. The Marsden online archive was created to coincide with these bicentenary celebrations. The archive makes Marsden's letters and journals, as well as those of other missionaries, available to the public. In order to have the collection online by December, we had to limit the initial scope. So we just include documents from the period of 1808 to 1823, which coincides with the arrival of Henry Williams. This totalled 599 letters and journals. The first iteration is just a pilot, but it has created a model that we can use to add additional material later. This is retired associate professor Gordon Parsonson. He has been a very key figure in this project and is one of the Hocken's oldest researchers and donors. He has been visiting the Hocken since 1947. Gordon first read Marsden's papers when he was accidentally locked in the stacks in the early 1950s. We have a webpage on the site dedicated to Gordon and there is a video of him talking about how he first found the material. It's a gorgeous wee video and I encourage us to watch it. Over the years, Gordon has transcribed almost all of the documents, not just the Marsden papers, but all of the Church Missionary Society records there held at the Hocken. Gordon starred this task by hand and has more recently been using a laptop. He's very generously given us copies of these transcripts to use in the Marsden online archive and he's also agreed to make these copies available to other researchers under Creative Commons attribution non-commercial share-alike agreement. So as long as he's attributed as the author of the transcripts, other researchers are free to use them. Mining Marsden was a collaborative project between the Centre for Research on Colonial Culture, or CROC, and the University Library. This was important as their digitisation projects need to be researcher-led. It is because of this that one of their first questions we had to answer was, what do researchers want to be able to do with the site? I was part of the team responsible for determining the research requirements for the interface. We met with several academics and researchers at Otago, as well as researchers from other institutions and some of their postgraduate students. We needed to find out about the type of functionality that researchers would need. Some researchers found this a little bit difficult, so we used pre-existing sites and used them as examples. However, when we asked would you like the Marsden online archive to be able to do this, the answer was typically yes. As well as meeting with researchers, we also met with representatives from several other organisations. These are people who had recently set up digital archives. This included people from the National Library, Victoria University and Canterbury University. They were incredibly helpful and they gave us some really good advice that we were able to follow along the way. This entire process resulted in us having 60 business requirements and this ranged from ideas for the URL to advanced search functionality and TEI markup. For the first iteration, we wanted to be able to give researchers a digital space where they could replicate some of their analogue research techniques. We wanted them to be able to do what they've always done but in a digital way. We then added value to their process by adding functionality that they wouldn't have typically had access to, such as keyword searching. Some of the researcher requirements were for specific filters and advanced search options. This meant that a significant amount of effort had gone to generating the metadata. Gordon did an amazing job of transcribing this material but what he hadn't transcribed was all the additional metadata at the top of the manuscripts. Information such as the ship the letter was sent on and the date it was read at committee needed to be captured. This is the information that the researchers needed us, told us that they would need to have access to and be able to search on. A spreadsheet was set up to capture all this information. We call it the master index. This had all the data from Harkana at Archives Catalog and it in turn set about populating the missing information. All of the documents were examined individually and the metadata recorded. We took this opportunity to measure all of the manuscripts so that users would have an idea of the physical size even when they were looking at the digitised material. Once all of this information was captured we handed it over to a metadata and indexing developer. A tool was created that we call ASMIC that stands for automatic content extraction and metadata creation. All of the data was ran through this tool before it was uploaded to the site. This tool takes all the files that we gathered the transcripts, the master index and the images and creates all of the outputs required. It produces a unique metadata combination a METS wrapper with mods, mix and TEI and it does this for each individual page as well as the item as a whole. When the TEI is created it will automatically add in the markups for created words, alternative spellings, underline and cross out, dates, peoples, ships, key terms and places. This is done across the whole letter or journal and is added to both the METS file and is available as a standalone TEI file. The tool also creates an HTML file that is used for each page of the transcript as well as a TEX file. The HTML is used to display the transcript online and the TEX file is used for solar searching and indexing. It also creates a PDF metadata file for people who aren't accustomed to reading XML as well as a Chicago A-style citation that we display on each item page. It then puts all of this information together in a folder structure that allows items to be bulk uploaded to the site. We created this tool in-house because we couldn't find any pre-existing products that would complete all of these steps for us. Using this tool significantly reduced the amount of human intervention required to make this information discoverable. However, as you can see, in order to create a metadata we had to employ a mixture of both manual and digital approaches. I mentioned that the TEI markup adds in alternative spelling tags. Researchers told us that the ICOVE would need to allow for alternative spelling. This is because some words have changed significantly over the past 200 years. This is particularly true of Murray Woods as these were historically spout out phonetically. This was done by creating a master list of words that were spout incorrectly and then manually adding in the corrected spellings. Although this is a significant amount of effort, this is extremely valuable because it means that the archive can bring it results even if the user has typed in the modern day spellings. Digitisation was a significant part of the project. It took five and a half months to caption edit the 3,728 images required for the site. The digitisation team used a camera because of the fragile nature of the manuscripts. The entire process was designed around making sure that the item was moved as little as possible and as well as making sure that the images were as consistent as possible. The team did a lot of testing to make sure the setup was right. One of the main considerations was getting the lighting set up. Once this was set up it was marked on the camera stand so if anything was bumped or moved it could be shifted back. The team also looked at getting the exposure correct and sorting out colour cast issues. Having a specific setup that was well documented allowed for the transition between staff. We had two staff working on digitising for the project. This process can then also be replicated for other similar projects in the future if needed. The team set up a monitor with live view to help make sure the images were in-frame and straight and so they required less cropping. This made the process a lot easier. For all their testing colour cast still create issues. Colours were reflected off surrounding objects in the studio as blue room dividers and even off brightly coloured clothing worn by the person operating the camera. The room dividers were covered up and the team decided that nothing too psychedelic will be worn during shooting. Light reflection was another issue during the process. Shiny metal light fixtures in the ceiling had to be minimised as well as this the main room lights had to be off during shooting. As there were 3,728 images the team batched edited as much as they could. They used Photoshop CS6 Extended for the editing process as this could handle the raw files that were created by the camera. This was also useful for setting up colour profiles for batch colour corrections as well as tweaking shadows and highlight. Cropping and straightening of the images unfortunately had to be done manually one image at a time. This couldn't be automated as all the letters were widely in size and with the journals as the pages were turned as they digitised the spine started to drift. Once all of the editing was completed the team then ran a bulk renaming tool so that all the files were named based on the archival number. The files that were then uploaded to file share for the development team. Needless to say the editing part of the process took up the most time for the digitisation team. It was then up to the development team to create all of this information together and put it online. The Mars and Online Archive platform is made up of Fedora Commons which is the backend repository used to store and manage all of our digital material. Islandora which brings together the different layers of the platform and is used to manage a different content types. Drupal which provides access to the site through a front-end access layer and Solar which does the searching and indexing. So it could be used as a pilot for a digital asset management system. This was a fantastic opportunity to upscale our staff on these technologies which were still relatively new to us. All of the software is open source. We took this Elderbox platform and customised it for our own needs. Over 72 customisation and configuration changes were made to the platform and three staff worked full-time and one part-time developing for the project for nine months. One of the main customisations is the on-screen transcript viewer. Researchers wanted to be able to display the transcript and the manuscript side by side so they could read both simultaneously. This is one of the biggest changes that we had to make to the Islandora and Drupal code. Another example of the extended functionality that had to be created is the highlighting of search terms. This alone took a week of full-time development effort to customise, however it's well worth the time as it helps users identify where their search terms are appearing. We wanted researchers to have instant access to all the facsimiles that we created. There are ten final types that are available, including a high-resolution TIF images. Researchers can also go into a page view and just download a clipping of an image. There's also functionality to email a link to a letter or even a particular page of that letter and researchers wanted to be able to email the images from the site so we have enabled this. When we talked to students, they wanted a citation that they could use in their essays and reports. For the Mars and Online Archive, we have generated a Chicago A-style citation as this is the style that is used by a history department. So where to from here for the site? There's still plenty of possibilities in terms of functionality and because this is just a first iteration, there is still some work to do. One of the things that we've already identified is an account functionality. This will mean that users can log in, save their letters to a favourite list, add notes and tags, and then share their commentary with other users. We've also identified that users may wish to bulk download files from multiple letters or journals. This is particularly true of users who wish to use the text files to run their own algorithms on. Beyond 2015, we've discussed functionality such as crowdsourcing material. We were lucky enough to have the transcripts for this project, however for future projects this may not be the case. And there's also the potential for geolocation tracking and special storytelling. This is because we're possible, we've already added in the geocoordinates to the metadata. So here's the web address. Most of you probably got a business card as I popped round earlier. I encourage you all to have a look at the site. We are incredibly proud of what we've achieved and feel free to. Does anyone have any questions? Did you say that the original transcripts were done by hand? So they weren't done into a computer? Is that right? He started them by hand but then he went back and talked them up as well. So you didn't have to worry about those? No. That's a very excellent project. I wonder though, when I look at it, that you've built it around the original manuscript headers and started around the transcriptions with links to the images. But that would really be far more useful for people coming to groups with the intellectual content that you're talking about. So if you're looking at the contentious question of how much power you've mastered and actually understood into a keyword search, you've got beautiful keyword in context display results. But then you go to the letters and click again to look at the transcript and then you've got the transcript sort of matching the original manuscript down the page instead of if you focused it on the transcript rather than on the original. Okay, thank you. I was just interested in your technology stack and going forward using it as a dams for the library. How much further development and support will you need for that? You said you had three full time and one part time for nine months. So just for business as usual in the library as a dams will you need much more development and much more? There will be another project to create the dams. So we've created Marsden as a pilot and we can use the same back end technologies but there will be obviously administrative interface that will have to be created. But we're hoping after that the business as usual upkeep should be minimal. Then again moving to Fedora 4 and itself is going to be quite a significant project so the team are very excited to take on those challenges anyway so it should be okay. Is that open source as well? The tool that you created? It's not but it could be. I'd be really interested in it. We're happy to share it. We really are. We'll go up on GitHub I'm sure but we're just not quite there yet. I understand. The other question I have is around is there like a data view of what you've created? So I guess the TEI is kind of like that so you're exposing the XML markup of the document, is that right? There's kind of there's pages but can the public get it at the data view or does the public just get it I mean most people will just want to get it at the beautifully marked up. So you have all your downloads on the item page we have also enabled sorry the acronyms escaped me but we've been harvested by Digital New Zealand so all that information can be extracted and we also have clean text files that James talked about earlier that you can download directly from the site but as yet we haven't enabled that interaction with the backend stuff but we do hope to. It's a really awesome project. It's kind of along those lines as well as actually pushing up all of those XML files and stuff into a GitHub repository that way people can bulk download the whole kind of thing so you don't have to do that. We can definitely do that. Follow on from probably both John and Brian's question if you are concerned to be a dams the big thing seems to be missing and I don't actually see this in any sort of system that's out there but I haven't really looked probably. But it's an ability to display a hierarchy in a sensible way so you've got a collection here of letters, journals bits and pieces of other things papers but there's no hierarchy where you can display letters and this is a leaf of a letter and these letters in a chronological sequence or letters to someone or letters from someone and I think when you're putting an archive online I'm assuming your archive is arranged kind of like that or maybe it's just a big pile I don't know but have you had any thoughts about how you can address that hierarchy and displaying the context and relationship of the items? Yeah so based on the archival structure there are six folders I guess that these items related to and we have left that out for now but we're hoping with Fedora 4 that we'll be captured within Fedora it's certainly a lot better at capturing what Fedora 3.7 is so we do have that information captured it's just not enabled on the site in any way but you can certainly search by all those things you can search by letters or journals you can search by the ship and the authors and everything so you can drill down that way but browse by that kind of hierarchy yet. Congratulations, it's a great project I'm just wondering with this kind of one name archives the expectation is that maybe other Marsden content that's in other institutions or private hands even might be able to be discovered there and in New Zealand we've got those sorts of challenges with James K. Baxter material or Frank Sargeson or all sorts of named figures who are represented in lots of different institutions so is that something that you see maybe happening longer term? Most definitely, it was always our intention to reach out and include information where possible or even link to it but it just ran out of time. Stage 1 I'm not sure. And just on that topic I think it's a useful thing for us to think about an MDF really because the National Library is just releasing a digitalisation strategy built around collect, connect and co-create. That's that connect but I don't think we're doing very well yet at the moment. We're creating things individually as institutions and we're not connecting them up very well so I think that's a good thing for us all to think about. I agree. Just a lady at the back. My question is based on an assumption I'm not sure it's correct but Fedora is not a long-term preservation system is that right? I guess it depends how you use it. I guess basically my question is do you have long-term preservation plan and technology in place for it? And as a side to that have you considered how you'll meet your legal deposit requirements with the project? Gosh hard questions for a business analyst. I don't know to be honest. I do know they are storing the preservation quality images within Fedora and that is their intention I'm not sure the implications is that long-term. But I can give you contact details to someone who will be able to answer that question more thoroughly for you. So the connection thing is a really good thing but I know you've chosen like a non-commercial licence and that was the only reason there's a kind of reasoning behind that. So it's just kind of limiting the connections between organisations. Yeah I guess it's just to stop people making money off it we're happy for people to use it and we're happy for people to share it and it's just don't sell it on but I guess that's what it comes down to. Yeah Cool, that's it. Thank you very much. Thank you.