 Thank you. I'd like to welcome you all to this session. I feel a little bit like I'm preaching a sermon in a European cathedral here with with a number of us in this large room But nevertheless, I thank you very much for coming out this afternoon And I'm going to be tag-teaming this performance here with my colleague Kate Murray from the Library of Congress So we'll take just a moment to introduce ourselves and we'll launch right in here. My name is Chris Prom I am an archivist at the University of Illinois and also a faculty member for the past All about 18 months or so. I've been Co-chairing a task force on technical approaches to email archives along with Kate. So Kate if you just So I'm Kate Murray and I work at the Library of Congress and I specialize in digital file formats And I've been working with email and related personal information management formats for a number of years and Yeah, I'm happy to be co-chairing this with Chris. So why don't we just start our presentation? Sure, so this is a topic of deep interest to me as an archivist We have email collections at the University of Illinois that we manage and in addition to that I wrote a technical report about email archives several years ago So the purpose of this task force which is sponsored by the Andrew W. Mellon Foundation and the digital and our partners the digital preservation coalition in the UK is to assess current tools methodologies approaches and Conceptual models that can be used to acquire Born digital correspondence and to manage it as an archival resource or as a curatorial resource in in repositories We would of course like to thank both the Mellon Foundation and DPC for supporting the work The charge of the group as you can see from this slide here essentially is a technical charge We very explicitly ruled out policy development and legal analysis and things along those lines Both because they're very difficult, but also because they vary considerably from institution to institution From case to case and more to the point from continent to continent As those of you have been following the news for the past few days know The European context for information policy and personal information management is quite different than the US Perspective so we were really interested in assessing Technical approaches that can meet a wide variety of operational contexts and legal contexts We do have a quite broad membership Composite of an executive committee myself Kate a few other people and then a larger group which helped to author a report Which will be issued relatively soon and we've been we've been very Pleased with the group the group has really done some excellent work Which you'll you'll hear about in just a moment would like to note in particular that we've had some representatives from industry engaged in the project Particularly from Google and Microsoft who have both brought a interesting and complimentary Perspective on on the types of issues that were we're dealing with in in archives in addition to that the group has had some Friends some people who have been given early access to the report and really offered us interesting and useful feedback You know these range from state archivist groups the COSA and Agara Through and including institutions like NHP grant granting agencies like NHPRC One particularly fruitful avenue of feedback for us has been two breathing days Which we held in the United Kingdom where we presented our work initial versions of our work at the UK National Archives and then again at a at a conference facility in central London back in in January One of the outcomes of the report is actually quite a bit of supplementary documentation Which we believe has considerable value in its own right In authoring a report of this nature Obviously there are many resources that are developed and are useful, but just can't fit into into the report itself So we've gone ahead and made those available on our project website. These talk about things like email user features Privacy issues guides to email standards things that are really really hard to track down that you can track down if you want to by reading IETF standards documents or things along those lines, but but just are hard to find in one place So those are those are a useful output put from the report and we'll continue to to support them Just to give a quick update current status We did submit a manuscript of the report to clear who will be acting as publisher in late March And that's currently going through editing and we expect it to be published in You know early June probably by the time everything said and done Meanwhile, we're are continuing to discuss potential follow-on work from the report Which is really the purpose of our of our presenting here to give a summary of the report and more particularly to focus on the recommendations from the report Leaving considerable amount of time for discussion questions and commentary so that whatever work is brought forward from the task force Report can meet the needs of the community as optimally as necessary So with that brief introduction, I'll turn turn things over to Kate who will describe Some of the main body parts of the report before we turn to the recommendations So the record as Chris just mentioned I'm we're just gonna do a high-level overview of the sections of the report the The first section of the port addresses recognizing the value of email and both Chris and I sat in the executive roundtables around researcher use of email of yesterday and today and These topics came up and certainly in mind and perhaps in Chris's as well, but You know what data lies within an email collection who might find it useful and why is accessioning problematic So we discussed those in the first section. So just a little bit on this image It's Rodin call to arms and what we really wanted to get across in the report Is that email is really technology bound right and the technology needs to be active and focused with care and feeding If it doesn't the data within will remain inaccessible and unaccessed and until these bonds are broken So the the call to arms is now the time to act is now and the people to do it are in this room and lots of others That are not in this room So some of the things that need to change We need to embrace email as a complex research data object We need to harness new technologies like NLP natural language processing and machine learning We need to encourage content creators to take active role in its preservation Even for personal papers and we need to build towards greater tool interoperability and towards deeper community integration and engagement So in the sections second section of the report we talk about the email life cycle And email really has comes from two perspectives. One is a records management perspective, which is typically well Integrated if you're in a large institution. The other one is when email comes in through gifts and Personal papers and those are really two completely separate things if it's in a records management It's part of typically a formalized record management program But if it's coming in through a donated materials context the email is usually outside of records management record-keeping structure Typically without organization or conscious retention about conscious attention about what has been retained or deleted This is a representation model That we're using to demonstrate how that a life cycle applies to email in general As it email moves through the life cycle from creation and appraisal to acquisition and processing to preservation Discovery and finally access we encounter lots of different players with different needs and many technological challenges Decisions have been made that have strong implications at every stage further down the chain and along this life cycle various players from The correspondence to the collectors may decide to keep or weed any of the messages or attachments or any of the associated data that travels with them So the third section of the report we dive deep into really what is email from a technical perspective and how it works and Given its ubiquity, you would think that it's well understood or documented But in fact, it's really not not as well documented as one might hope At its heart email is a transaction right in the case of an email We take the message component That's the actual email message but that also might include attachments and something that came up in my executive round table this morning Links to External data right networked data and how does that? Factor into capturing email. Are we now doing web archiving in addition to email archiving? And as well as other features beyond text So it's really kind of difficult sometimes to draw the line around the box of what is email It's really not as obvious as you might think it is I'm gonna turn this over to Chris. Okay, so After reviewing essentially what email is from a documentary point of view and sort of to set the stage for Challenges that can be and approaches that can be used to capturing it the report includes a brief assessment of Current trends both in industry related to email as as well as in repositories as well as in archival repositories so There's a good section of the report that talks about what's happening in the broader community because obviously email services are a very very big industry There are you know, lots of things that go around a goal that are the lots of services sold around Authenticating email I'm trying to prevent spam and phishing attacks and all of those kinds of things and these abuse prevention technologies Can both assist and complicate the fact of or the attempt to preserve email or make it accessible over the long run another area of really strong interest to us are Tools that focus around compliance business compliance and legal discovery because the industry has Developed for various obvious reasons for any of those of you who have been following the news Recently will know Means to look into and desire to look into what's happening in any kind of a communication technology email being a primary one of Those the problem with a lot of these tools is that while they're very useful for legal discovery They're also very very expensive and they provide many features that could potentially be very useful to archivists or curators who might want to for example Redact a collection or identify or classify correspondence that really is out of scope for Public access or really any kind of access because of legal restrictions or what have you but they're just simply beyond the reach of most Most archives so you know much like the lawyer in this joke They're they're very very useful, but they're also potentially very very expensive So one area for potential work within the community is to develop open-source versions of some of these tools are to work on business pricing models that would complement, you know, what? High-priced New York law firms pay for these tools one of the more interesting Moments in the in the work that we did was hearing from a lawyer actually who was involved in the LIBO investigations Leading to very large fine for Barclay's bank and for some other banks and he pointed out that they're essentially in the same type of business of telling a Story as historians are trying to piece together evidence so these tools have a lot of potential use but perhaps in a slightly different way The report also Spends a fair bit of time assessing repository challenges specifically around capturing email materials Because they sit in a variety of systems. It's very very difficult. They come from You know outlook they come from a lot of different Services and they all need to be handled somewhat differently Also around ensuring authenticity tracking processing actions Kate has already mentioned the difficulty of dealing with attachments and linked content I know in the executive roundtable session that I attended yesterday There was very interesting discussion around the fact that maybe within email collections You have very very interesting documents that are attached and those might in themselves be preservation objects worthy of attention Because as a transaction as a report is sent from someone to someone else There's a bit of evidence around that and the report itself might have value But how do you how do you preserve it? Then security and privacy issues and also processing at scale are some of the challenges that Libraries and archives would face when dealing with email collections So one key section of the report is about potential solution and and some sample workflows The there's nothing necessarily new here about the various preservation approaches The first one would be bit level preservation Which I describe as you get what you get and you don't get upset And I would say that for the most part with email collections. That's what people are implementing So they are ingesting the email collections, you know, maybe they bag it they ingest it But there's no access to it. There's typically no processing There's no appraisal to it because they haven't built up those tools yet format migration is a big part of email archiving mostly because tools such as E-pad and and some other tools are format dependent. So there's a lot of format migration that happens And emulation I would say for email archiving. It's really an opportunity. I think where emulation can really shine, right? So if you're looking to Explore email in its native Environment I think It's a great opportunity for emulation to do that In the report we have a number we did an extensive review of tools from both within the culture heritage community and Other communities including legal discovery those tools are a Review of those tools are on our website, but one of our recommendations is going to be to move them to copter to help Keep them more current And we also have a sample workflows from each of the different preservation options that you can adopt from organizations such as Stanford and Harvard The Smithsonian and others So the really the meat of this report or the meat of our talk today is really about the recommendations and the next steps So we have organized our recommendations Into sort of two categories community engagement and advocacy and another one around tools and tooling and then we further subdivided those into What we call sort of low barrier Activities and then higher impact activities So the first section that we're going to talk about and remember I said on an earlier slide is the people that should do This work is you and me you the royal you which would include myself so we have a number of What we call lower low barrier activities that can sort of self organize and members we should say members of the task force are Hopeful that we'll be able to participate in this work as well. So the first one is about assessing institutional readiness and What we would like to do there is to Make a Version of the NDSA levels of digital preservation Specific around email so it would help individual institutions understand where they fall on the spectrum. Are they are they just? Are they ready to ingest email at all? Are they bit level preservation ready or or are they further along? on the spectrum The second one is and I would should also say that we heard a lot of echoes around this in our executive in my executive roundtable and probably Chris did as well the need for Skilling up of existing staff and new staff to deal with email collection. So what are the issues around? What makes email challenging? What makes email special and how can what tools are available? and and What what do folks need to know in order to be email archiving ready? We have heard the next one quite a bit that there's an issue of trust with donors Donating their email collections to institutions and one of the recommendations from the task force is to develop sort of a template to Have staff be able to talk with donors about what those issues are to sort of take that fear away to understand what is Possible to redact and what is possible to what tools can do around sensitivity review? To help demystify the email process for donors. There's We didn't mention this earlier, but email is really a Domain in which the personal and the private and the professional intermingle, right? You typically my boss is in the audience Of course, I never use my work email for anything personal but people often use Email for multiple purposes, right? You might be talking about making a lunch plan or or you know, and your next email is about You know some some major events at work, so There's really an intermingling of those and I think that's partially the fear that donors have And sending their email along I mentioned earlier the idea to move our assessment of email tools into copter and Finally another one is to develop a format comparison matrix for email formats There are certainly models for this in my day job I run the Fadji the federal agencies digital guidelines initiative group And we've done a lot of work on the audio visual working group of that and we've done a lot of work around format comparisons and I Think it would be very interesting to lay out a format comparison matrix around popular email formats both proprietary And not proprietary So some of the higher-impact activities for community engagement It's really around sustaining the email archiving community. So We're not ready to move yet for potentially to consortial activities for Some of the open-source email tools, but we really need to think about how we can sustain some of the tools that are already out there Another one is around Planning specification for beginning of life cycle tools and in the US certainly the state agency State government repository programs have taken the lead in customizing some of the enterprise level tools that have been effective for capturing identifying and managing records But the wider community can learn from their experiences potentially through a summit or a collaborative knowledge sharing project a Big one is sort of developing criteria for what makes an email authentic, right? So if you're If the email has gone through format migration or has been ingested What are the key components of that email that says it's that it's at its essence what it's supposed to be what it declares It should be one of my favorites my standard joke around these things is that because I Another part of my day job is the sustainability of digital formats website at the Library of Congress In which I read the RFC's so you don't have to write I've read the RFC's for mbox and they are unclear, right? They don't specify all of the different versions of mbox and there is no RFC for eml, right? It's based off another format so It would be great to notch up the standards documentation for these email formats Get another one is and I'll talk quickly is To improve PDF as a as a a viable function for email. We all have in our Email systems export to PDF, but if you export to PDF you typically lose a lot of your head of it header metadata You potentially lose some threading. You might not have your attachments, etc. Etc the PDF community especially the PDF association is super keen to help us with this problem If we can define requirements for what makes a What is authentic about a peat? What is authentic about an email that could be translated into the PDF format gosh? They're all over that because that's a huge new market share for them And some of the members of the task force including the Library of Congress and the National Archives a member of the PDF Association and they're very keen to work with that and and finally to demonstrate email as a research Source of research data we call this sort of our data challenge In which we would work with some potentially work with some organizations who could provide Open full access to email collections invite some scholars and some digital humanity people into do research And then have them write about what their research experience was And the the research that could not have been possible if they did not have access to an email collection And Chris is gonna wrap us up right and then as Kate indicated the second set of recommendations is really based around tooling and tool support Some of these for instance can be really high-impact activities. So for example the PDF one that Kate mentioned if vendors of tools were able to You know produce better PDF files when people print them from outlook or whatever the filing of those would be much better There are still a lot of archives that are are following a sort of print and file approach So there are a lot of things that can be done in in that respect But on the low barrier side of things one of the first steps would really be to test the existing tools that are out there and Say exactly and find out a little bit more about how they do at retaining data or not retaining data there's just a real basic level of Of sort of basic research knowledge that we don't have about how these tools work when you move Things from one format to another Which leads into the second point about improving tool Format identification characterization and validation tools So if you take a message exported from outlook as a pst file run it through Some of the tools that exist such as e-pad or tomes or some of the industry tools Do you really know what you got out on the other side is the same thing as what you put in? So that's again a real basic research question around which we'd like to do some additional testing Then some of the higher-impact activities. I think these could be really interesting research projects Bringing multiple fields into play including curatorial practice in archives as well as computational Approaches machine learning approaches and artificial intelligence approaches one of the biggest needs actually are for improved tools around sensitivity review I think those of us who have worked with people in universities who have email they'd like to preserve or Legal counsel who would like to see a lot of email go away know that one of the biggest concerns is trying to Filter out things that might need to be restricted for for legal reasons or what have you and the tools are pretty good But again, there's a real lack of knowledge around whether they're better than a human or not better than a human I would think that if we can do facial recognition Technologies as well as we can we can probably do good sensitivity review around text-based collections as well So there's quite a bit of opportunity there for additional work Another really interesting Theme that focused in discussions of the group as well as in the executive roundtables yesterday and today Was the desire for a self-archiving tool? So as people leave positions as they move from institution to institution They may have a desire or a need to take records with them for their own You know Institutional more personal memory or to pass on to somebody else and there are a lot of very idiosyncratic practices that take place around this and in addition such a service if it's scoped properly I think could be of Quite useful for people who want to maintain a record of their own Correspondence under their own control it wouldn't necessarily need to include just email records It could include other types of documentation as well whether they're taking place from slack or You know Twitter or whatever uses an API So I think there's an interesting project potentially there and then one of the biggest is Developing standards for tool interoperability one of the points Kate mentioned is that there are a variety of tools out there They all require rather complex workflows to chain together That it's it's possible. It's doable, but it's hard. So Simplifying it through interoperability is really one of the main themes of the report So that's the report. We'd like to leave a few minutes here for questions and comments We really hope to see the reports work brought forward in some way by the community So we just open ourselves up to your questions at this time