 Hi, my name is Mark Wolf and I'm the curator of digital collections at the University of Albany. I'm presenting with my colleague Greg Wiedemann who is the university archivist here at the University of Albany. We'll both be presenting the mailbag project, building digital preservation tools around file systems. This presentation comes out of our work on a grant project. The grant project is titled mailbag a stable package for email and multiple formats. This project is funded by the University of Illinois email archives, building capacity and community project. I want to talk about the project origins. A few years back we wanted to document state politics specifically document elections here in New York State. In 2016 we did a project where we collected fundraising emails from federal level incumbent candidates. We went to the candidates websites, we signed up for email to receive and receive political information and so forth. We had a student create a Gmail account to sign up for email blasts and of all sorts of email relating to their campaigns. At the end of the election season we downloaded the inbox file export and added it to our repository workflow. So we had the inbox and our question at the time and still is now what or was inboxes as an email format are great they're well structured. Basically we tried processing all of the politicians email from that 2016 and later on in 2019. We used Python scripts to extract email. In particular we were trying we're realizing there was a lot of HTML and these email messages so we use Python scripts to extract that as well. We converted to PDF and we used what is called WK HTML to PDF, which is an open source Python tool to render HTML into PDF. So what we learned is that pathways to email processing and preservation if you will actually do exist. But email counts that were processed lack crucial information in the email so we, we didn't do a lot of work but the, the output that we got we weren't happy with the pathways are challenging for archivists they weren't not only challenging for us but we can imagine challenging for most archivists in the profession, both in terms of workflow and the tools to preserve email. Much of this work that we did all required customized computer code that we created in Python mostly. At the end of the day what we had was we had email exports that didn't contain images they were missing. There was CSS that was required to render a lot of the email, and there was links to web content that got either broken or lost in the basically in the process. Complications with email marketing software which is so popular now. Most email from politicians we found we're being launched from tools like MailChimp where you're actually in a web application and sending email rather than say sending a blast from like an outlook account or a Gmail account. So we found that basically email export files were degrading over time. It did not include externally hosted content so that would be like a say a MailChimp platform images CSS, potentially interactive within the email was or will be lost. And sometimes we were finding significant data loss just within a matter of months again because a lot of this content is hosted on the web and as we know the web is just constantly changing and being updated or deleted or lost and so forth. And then with that links were to web content were either broken or lost altogether. And a lot of the email marketing software and campaign software uses its own way of basically counting clicks and so forth so sometimes the URLs are not really clearly addressable or human readable. So some notes about email preservation. Email preservation is currently a challenging endeavor. There's no single preservation format for email that exists currently. For example, with other challenges that the archival profession has had. We have image and audio formats, for example we have a preservation format like tiff and raw for images for audio we have dot wave. Both audio and image preservation formats are widely used in the archival profession, as well as in the private sector. But for email there is no single preservation format. We have inbox and email. They pervert preserve structure very well but they don't really do well with content. And PDF is a proprietary outlook outlook format owned by Microsoft and PDFs while they're great for access and conform and do really well document look and feel and easy to print out. They do not preserve structure. Basically at the end of the day of the PDF you're merely left with an image or maybe if you're lucky and image over unstructured text. And emails are static documents increasingly as I said before emails composed of HTML content linked to the web. emails have dynamic contact like content like hover overs animations and other typical HTML based interactivity. For example, email marketing platforms like MailChimp increasingly rely on dynamic web content. And not many open source processing tools are usable for archivists at this point. So the mail bag approach using multiple formats, the approach in our project is to allow for multiple preservation formats. For example, a mail bag could contain many combination or number of formats such as PDF which is provides a nice ease of access a work file to preserve web content and interactivity. We have an inbox in the email formats to preserve structure. Any number or combination of those formats can be used using a mail bag. We have many incarnations listed in what a mail back may hold which reflects the rich diversity of content held and creators email accounts. The mail bag deliverables, the grant project builds on bagget a specification developed by the Library of Congress, the bagget specification which is a quote here a hierarchical, hierarchical file packaging format for storage and transfer of digital content. This will be the underlying specification on which mail bag relies. The bagget specification a bag has the enough structure to enclose descriptive tags and a payload and but does not require knowledge of the payloads payloads internal semantics which is perfect for our purposes of creating the mail bag specification. The bagget spec is specification is widely used it validates fixity and it uses the computer file system for structure. The mail bag tool is will is being written in Python as we speak the project will create a mail bagget tool that can be used via the command line interface or graphical user user interface. A basic GUI will will allow archivist who lack command line tools to use the tool and also goes without saying command line utility will be available to run the mail bagget tool with perhaps advanced for functions when using the command line. The mail bagget tool will package up all formats of email into mail bag, and it will enable capture. So I'll talk about the specification development process. And in building an open specification we wanted it to be independent of the mail bag tool. We wanted the specification to be community driven and so we reached out to community experts for input and advice. We sought feedback for a variety of use cases we developed personas to aid in this process. And we created a suggest a user Google forms to help in gathering feedback to make it as easy as possible for community members and others to submit their feedback. Our user stories range from various types of archivist rules roles to historians and genealogists, as well as other roles. We created as well a suggest a more prioritize a requirement form to gather needs for the specification as well as the tools design and functionality. We wanted to basically allow for additional implementations support broader use cases and we can resource and promote interoperability. The project created an advisory board to consult on the project specification development and scoping, gathering community feedback and advice on the software development process. We put into the grant to offer honor area for the mail back specification working meeting. This working meeting was envisioned to be a two hour mail back specification working meeting that would be hosted over zoom. The project team with feedback from the advisory board and the consultant developer wrote the first draft of the mail back specification. Greg and I hired to graduate student software developers from you all need to begin coding the tool and Python. We also instituted a collaborative development process that also includes the more experienced consultant developer in this process. At this point I'd like to hand the presentation over to Greg my colleague. So I'm going to talk about what we came up with for the mail back specification. So this is what it sample what it might look like. It'll look very familiar to anyone who's who knows bag it. It's a mail bag it's just a baggy bag with some extra requirements in the payload or data directory. There's specified some folders that you'd use for whatever formats you're using so if you have email represented inbox or PST or pf or work in your mail bag that's what some folders names. There's also an attachment subdirectories where all the attachments for the emails can be also stored optionally. The only other requirement is a mail back CSV tag file that will talk a little bit more about it a bit. So diving in a little bit more on what that payload directory looks like. Within these format directories you can also have additional subdirectories that can represent the folder structure of the email account so if you have inbox or email folders that gets preserved as well. And if you notice it is semantic so it is designed to be human readable so it is both structured that we can use it computers can use it and humans can navigate a mail back just as an extension of bagging. So the mail back CSV files is needed to connect those different representations of email together with each other, and we use that via identifier system now emails have what message ID as a unique identifier, but that's not a not always included in the email export formats, and it also often includes special characters that don't get written well to the file system for file names and things so we actually unfortunately had to require in this specification a separate identifier system so we're having also an email bag message ID field, as well as message ID, and that can just be any any unique number unique to a mail bag that could be written to a file name, so it can just be a sequential number 1234. So the CSV sort of contains both of those identifiers will actually look up and also message pass you can find where it is within the higher, the, the mailbox directory structure where you can go find it with the payload directory, the original file name to what is where it actually the original source of the email was and a integer number of attachments and optionally you can also put a lot of a common header information from emails in this mail back CSV that may or may not be useful depending on your use case. So, we really, as Mark sort of over you, we really thought that we wanted this specification to be community driven so we had a really, hopefully we had a really robust community feedback process we both sent out of community call and then we did our, our working meeting to get some people with like hands on expertise to really pull apart the format, but there's two, two instances and two questions and particular kept coming up that we couldn't really get consensus on, and that was one can or should mail back contain multiple email accounts. And can it email be used a mail back views to package multiple versions of email such as for reading and redactions which is really common and email workflows. So we wrote these up as GitHub issues on the mail back GitHub account so that has more detail and also has some community comments to so we can sort of figure out where these 10 points were. And we thought we were really hesitant when we thought about these because we thought what really liked about this project is that it had its own niche. It was really simple, it solved that one problem, I have an inbox file now we're ready with it. So to get that inbox file into some sort of stable package that or that won't be great over time and and be that well enough preservation format that we can preserve it over time, and also enable some some near capture processing where we can make some PDF files and some sort of workflow to get access to the email. So we like that sort of niche of the mail bag, and these we felt were sort of going beyond that. But we again we really value the community feedback so we didn't really, we envision going to this process to adapt our goals and the project to the community. We thought a lot more about this and like why wasn't this meeting our expectations. And so what I sort of came up with is that I think these are not necessarily the questions that are our users were potential users were asking. I think when we were talking about can shouldn't be a mail back contain multiple email accounts. We're talking about maintaining data structure and relationships and we're talking about versioning. Both of these things are not email specific they're coming throughout digital preservation right, and moreover these are really challenging problems for us to address. And also we don't really have tools to combat these problems. So, we thought this was really beyond what we can do, not only with our with the mail bag project but also like stimulated some really interesting conversations about what we're doing in asking our tools. So we keep asking our applications to do so much of these, these workflow tools and it's really a duplication of effort because each tool has to perform these similar functions in different contexts. And they'll be less effective as set secondary requirements so that's not what primarily the tool was built to do. But it's also hard to build workflow specific tools like tools specifically for versioning or maintaining data structures because those structures differ across various tools and domains. You're going to find the data that you're working with with on digital video and emails to be really different, and the tools around them to be different so it's hard to build like versioning structures that work with both of them. But if we look at our needs that we have for digital preservation, we have so many different complex workflows and each of these new data type adds a bunch of different complexity and there's so many edge cases. I really like this essay talk about by Elizabeth England that talks about some of these edge case, edge cases. Digital preservation just has to be bespoke we have to do these really manual edge cases from time to time. And I can't really imagine supporting web applications for all of these needs. Why are we trying to keep doing things in this way. We talked about this a lot and I think this is more of a labor structure problem. Our organizations really have this division of labor right from information technology information professionals that are my dear your archivists or librarians and technologists and web developer or on system in. And they're sort of dividing to these two categories with two different traditional skill sets and web apps and other large monolithic tools like really fit this model really well right the developers can create maintain these applications and the archivist librarians can use the interfaces they build to to fulfill the missions. And these custom interfaces are inherently limited you can't really do for all the bespoke digital preservation needs we're not really going to get a drop down that's going to like fulfill all of our workflow needs for digital preservation. And also digital preservation labor ends up crossing these or that organizational boundaries, as well, and our practical skills, our practitioners skill sets also cross the boundary so on our practitioners really working in digital preservation need. So skills that's traditionally assigned with archivist librarians right duration skills, appraisal skills, and also skills traditionally in the domain of the technologists basic coding skills and even other skills that don't necessarily having to do with coding but understanding of file formats and how file systems work and text encoding and all those things. And because these practitioners or strategies boundaries our organizations aren't really structured well to support this on this really great article what's wrong with digital stewardship that I'll send that there's links to in slides that I think really highlights this. And because we don't really support these positions access these roles is not equitable. I don't have any like perfect magical solutions to this, but I think in thinking about how this affects the mailbag project I think we do have some sort of steps forward at least. And one of them is sort of using the file system as an interface on file systems are great. They're really ubiquitous in everyday computing life. They're really well maintained support you don't have to support a file system interface as from our library. They're also super flexible you can do a lot within a directory and file structure on their familiar technologists to support. Most places are able to support network shares or cloud storage that's viewed through a network share. And they're familiar for most importantly familiar to users of a variety of background skill sets so both your technologists can use a file system and your traditional archivist and librarian can also use a file system. We can provide both manual access for those bespoke digital preservation activities and also computational access at scale we can use scripts we can use powerful software can also act on a file system but we also can like navigate that directory tree and find that file. So to do this use a file system as this interface, we have to rely on the semantic specifications. So this is sort of what we're thinking about in the bag of specification or the mailbag specification. These are abstract documents that define what a tool does and how that data is structured in a sort of common way. And really these are great important communication tools that sort of we can sort of come together and collaborate to define what our data looks like. And they're really cool because they also allow participation across these sort of organizational boundaries, both technologists can contribute to these and also people with more traditional archivist and librarian skill sets. And they provide a common data structure that will promote interoperability between tools, if tools can out port to to this these specifications the file system and other tool can read the specification and be interoperable with it, and it also can be interoperable with bespoke preservation activities. So in this sort of model, the file system acts sort of like an API, and we can build sort of really complex tool digital preservation tools around file systems right we can manage checks on the track fixity over time we can validate file formats we can do a lot of things are digital preservation activities at scale on that files in a file system but we can also write a one off script that to solve a single problem. And we can also just navigate that directory sketch tree to find things so we can actually nothing would get lost, and we can actually perform those bespoke long tail preservation activities and we have to now go like really dive in and focus on those really those those edge cases. And the cool thing about this is that it allows our tools to have limited scope, they don't have to try to do everything, they can just output to the file system in a structured way and another tool for pick up from there. In this model, we sort of sustained systems of tools around these file systems, many small system tools fit our problems and digital present preservation much better than large model of systems we've known this for some time, but we don't we haven't been able to build these subsystems to be interoperable with each other. So I think it was really helpful if we think to rely on the file system as that interface between both tools and tools and humans. And this is semantic specifications are key to this interoperability so that both humans and complex tools can utilize them in a structured way. And that both of these things are cool because they bridge these organizational boundaries that like are really complicated to work around. So we'll enable both these, the human center bespoke work to do that digital, these long tail digital present activities, and also computational work that we need to do at scale and this is sort of the world that mailbag sort of envision it. So, I put a bunch of links that might be useful. These are the nobody project website where you can look at all of our design documents are requirements are our personas are user stories. And we're currently on working on collaborative development for the mailbag project in this fall and next spring so you can track that on our GitHub. And then these are two links for some citations that we had in the project. So, and this is our contact information if you have questions feel free to reach out. Thank you very much for attending our session.