 Greetings everyone. I want to welcome you to our session email archiving, building capacity and community. We want to thank the CNI conference organizers for making this event possible and personally thank you for joining us. This panel will have four speakers discussing projects related to the new developments in email archiving and how together we can make significant progress and fostering email archiving across a wide range of institutions. I am Ruby Martinez, the email archived community fellow for the email archived building capacity and community program, also known as the EABCC. The EABCC Regrant Program is funded by the Andrew W. Mellon Foundation and is a way to incite progress of preserving email across a wider range of institutions. Our overall goals for this program highlight the importance of transferability and dissemination to cultivate a community of practice. We awarded grant funding to five institutions deeply involved in innovative email archiving activity in our first round, three of which you will be hearing from today. The first round of program focused primarily on tool development and its collaborative in nature by connecting partners across multiple institutions or supporting email archiving community connections. However, I will let my co-presenter speak to that later. Before we get into their programs, I briefly want to mention that we have a second call for proposals for the deadline at the end of June. Stay tuned until the end of our presentation to hear about some target areas and potential projects we hope to see in the next round of applicants. With that, I will turn it over to Matt. Yeah, thank you, Ruby. So I'm going to talk to you for a few minutes about projects here at Columbia with my partners at History Lab. What we're going to do is figure out ways in which we can take the way that archives often handle electronic records, that is carrying them into PDFs. We're trying to figure out ways in which we can take what was originally born digital content that's now been digitized. Trying to find ways in which we can now turn this tech into data and make it accessible and also make it easier to preserve over the long run. So this to get started with the next slide. There are different opponents to this larger project. So as you can see what we're hoping to do is, as a case in point, we're trying to aggregate records released by government that is for federal government and state and local governments related to the response to COVID-19. And at the same time, we're assembling a panel of public health experts, historians, journalists and social scientists to try to determine what kinds of records you want to try to obtain from government. And given the way in which many, especially state and local government archives are underfunded, identify those records we really do want to keep for the long run. The last part of this, which is where we're developing tools, not just for this project, but that could be applied for many other email archiving projects. This is the part that's funded by the University of Illinois and ultimately the Mellon Foundation. So I'll spend more time talking about that. Next slide. Now, what we're doing here is we're developing tools. And we don't want to reinvent the wheel. There are already some excellent platforms that are built, especially the EPAD project, for instance, that are specifically built to make email archives more comparable and make them easier for archives to preserve. And so what we're focusing on is the problem that many of us encounter where the data that we would like to access and discover is embedded within the PDF files. And so if you look at the example on the right, you know, what often happens is you take something that was born digital, it's now been rendered as a PDF. What you want to do is you want to extract the metadata, like the from and the to and the subject and the date, and also you want to separate that out from the body text. And you can support all kinds of fascinating research. Once you've done that, once you turn text into data, and then organize it into a database and make it accessible through an application programming interface. So if you want to see more examples of that, you can look at the History Lab website and history-lab.org. You can go to the next next file. So this is where our data is coming from. So to begin with, we have hundreds of thousands of records that have been gathered by journalists at Columbia Journalism School in a project that's been funded by the Brown Center for Media Innovation. So if you come to these journalists in the form of these PDFs, some of them thousands of pages long. At the same time, we're hoping to obtain documents from Mock Rock, which is now the manager of Document Cloud, which is where many more journalists are storing records that they've obtained, not just from state and local governments, but also from federal government agencies like the CDC, the FDA and so on. So we have many, many different PDFs, some of them thousands of pages long, and just identifying which of these PDFs are PDFs that were generated by email clients turns out to be a job in itself. Because when they've been generated by email clients like Microsoft, etc., sometimes there's metadata that's embedded within these PDFs that we want to try to preserve and use for research and preserve over the long run. In other cases, we end up with PDFs that, in fact, may have just been scanned emails, right? And in these cases, we have a separate pipeline for processing them. And in many cases, even when OCR has been done, we have to redo this OCR to try to improve the image quality and the text output. You can show the next slide. So what we're trying to do through this project is, you know, make the FOIA archive corpus, a COVID-19 collection, a case in point, after what you can do when you have these massive PDFs. In this case, the pipeline is going to ultimately result in these records being stored by industry lab, but made accessible through the Columbia Library's online catalog that is Clio. So in this pipeline where you think these particular files to make this particular point, but our hope is that by sharing this software, making it open access that other projects, other archives are going to be able to handle other kinds of data with many of the same kinds of solutions. So let me go to the next slide. This test is a bit more about the scope of this collection if we succeed. Again, it's proving to be a real challenge. Just the records obtained by the Columbia Journalism School amounts to hundreds of thousands of pages. But again, if we succeed, and especially if we can get some follow on funding, we're hoping to obtain many more records through Doc McLeod and MUCRA. We're also going to be convening this panel of experts to start targeting different departments and agencies for FOIAs to try to fill in the gaps and find the most high priority records that we want to retain and start accessing for research. So we can go to the next slide. Just to give you a sense of it, these are just records obtained by the Columbia Journalism School project led by Derek Kravitz. And as you can see, this is just the top files organized by the size of these files. And each one of these, you know, is likely to contain many thousands, even tens of thousands of email. Now, we're only going to try to cover the period of the outbreak that is beginning, you know, December, January of 2020, up until June of 2020. But again, if we can manage to sustain this project, we're hoping to grow the collection overcome. And if you can go to the next slide. This is again just one example of the kinds of information we're finding. One of the other challenges of the last one I think I'll have time for is that a lot of these records actually have personal information. Even though they've been reviewed by state and local authorities who were meant to observe, you know, privacy laws. In some cases we're finding that they have PII. And this is a challenge for us that we're trying to manage. In some cases using software and some help from friends at a startup called TextIQ. In other cases we're trying to see if we can build our own to learn the algorithms to identify PII before we make these records more widely accessible. Next slide. There you have it. So if you have any questions, please reach out to me. I'm at Columbia University and you can also find us at historylab.org. So I guess I'm next. Hi everyone. I'm Greg Wiedemann. I'm the university archivist at UAlbany and our project is Mailbag, a stable package for email with multiple masters. Can you go to the next slide please? So email preservation is currently pretty challenging. One problem is that there's no real single master format for email, right? Docks and emails don't preserve external content hosted on other servers like images or CSS. PSTs are similar and they're also proprietary. PDFs do a great job of preserving the document form of emails for easy access, but they don't preserve the structure in a machine actionable way. And not all emails are static documents. Emails are really websites. They have webpages. They have their HTML files and they have CSS in them. So you can do things like hoverovers and some marketing emails utilize this interactivity. Another problem is that there's just not that many open source processing tools that are currently usable for archivists. Next slide please. So we did a project at UAlbany that was collecting the fundraising emails of New York politicians and we encountered many of these problems. This is an example of an email from a New York congressman that has an embedded image in it and a link in it. So, and even though we had this email as preserved as an inbox file, we actually don't have that image because that image is hosted on another server. And another further complication because they were using email marketing software, though the actual H references URLs embedded in these emails are actually to that email marketing server. So they redirect and track clicks on to they want to see who's reading their emails. So we actually don't even have even if this is hosted and maintained on the web somewhere. It might not be preserved or we might not be able to find it. Next slide please. And we're also finding cases where there's 404s in the email so we tried to process these about 18 months after we originally capture them and we found that a large portion of the content was already lost, just because it was hosted on external links, and we only preserved them using Next slide please. So our approach in this grant is to find a process preserving email and multiple master formats. So we can have inbox in email files that preserve the structure of email as data PDF files will preserve the document nature of emails and allow for easy access and work files because emails are really web pages. So we can use web archiving to preserve this work files and preserve that interactivity and allow them to be replayed and sort of emulated so we can use them like that in the future. Next slide please. So how can we do this in a sort of sustainable way is we really need a stable package. So we decided to use bag it, which is a really widely used format the digital preservation community. It also includes a process for validating fixity, and we liked it because it was really simple. It uses the file system for multiple, a lot of its structure so it's really, really easy to work with and build upon so as part of this grant we are going to draft a specification for mail bag, which will be a specific type of bag for managing files. So in within the data directory within a bag will be a couple different directories for optionally different types of email formats, either inbox email pf PDF files or web archives and they'll be specific types of metadata, mandated in the bag info dot text file that will enable tools to utilize and so the cool thing about this is that it's sort of interoperable it can it's not necessarily reflective of any specific tool set, you can just store this on file system so we hope that if we're successful. Other tools that are successful at other aspects of email preservation like you pad and the Ray Tom project might be able to utilize mail bag the mail bag specification as well whether through our implementation or through their own implementations. Next slide please. So the other problem is that email processing is pretty challenging for archivists, and we also need to make processing emails really easy enough that we can do it near to capture so we can get those external content before they expire or go offline. There's a lot of existing tools for doing much of this work whether the Python libraries there's a lot of different ways of generating PDF files from emails or capturing web archives, but they currently requires a level expertise that's not easy for many archivists to run. So, also as part of this grant we're going to build a mail bag Python library that will include a lot of these tools, and a command line utility so archivists, much like bagging Python can run a couple commands on to build a mail bag that will package and process your email hopefully near to capture. And we're also going to create a basic GUI using the graphical user interface using the GUI library so we can have that type of access as well. Next slide please. So this is hopefully what it will look like we're so the mail bag sort of the center of this. And that's the sort of center stable package so we can preserve the email, but you can utilize the mail back tool to get email whether from over I map or other external sources, or serialized as inbox or pst files. And then to mail bags and then later on down your workflows use the mail back tool to report header information or export PDF works or that inbox or email data into other tools so to support existing workflows. If you have any other questions. I'm at University Albany or contact information is in the slides. Thank you very much. Hi, I'm Michelle Gallinger and working with the Council of State Archivists on the prepare grant which is preparing archives for records and email. This is critical because email is so much of what is being produced in government records nowadays. Governors Lieutenant Governors secretaries of states key state legislatures, and many more are producing government records in their daily correspondence via email and archives are not always ready for those for those records. Next slide please. So we have what I'm thinking of as the trident approach five prongs about how we are going to be approaching the work of the prepare grant and the first is going to be the needs assessment although we have a good sense of what archives state and territorial archives are doing and where the support is needed. It, we really feel that having the state and territorial archives articulate clearly. The roadblocks that they see in their daily work to archiving preserving and providing access to email collections will help us make sure that the support that we're offering is the support that's actually needed. I'm going to talk a little bit about each of these prongs in more detail and so I'm going to go to the next slide please. The needs assessment is going to focus on the current email capability current email archiving capabilities of all 56 states and territorial archives. We want to make sure that we are supporting the state government archives and territorial government archives in their work to transfer the email records from the point of origin to the preservation system. Preserve them and also provide access to them over time. Every report will be available and will be published and we will use that report to determine the further focus of the work of the grant. Next slide please. One of the things that we're really focusing on our documents. While there's been a lot of research and a lot of reports produced about email particularly over the last five years. We find that state government archives and territorial archives are in the thick of preservation right now. And sometimes they need a very specifically tailored guidance document to help them where they're at rather than something that is more theoretical or more focused on the possibilities. We plan to create best practices guidelines for state and territorial and also DC archives and provide a path to tiered levels of email records management. We'll make these documents available on the COSA website and we're going to include templates and guidelines so that all archives can use the work that we're doing to update their own plans and workflow practices. Next slide please. The third prong of our trident is an apples to apples comparison in archival environments. A lot of evaluations have been done on processing software. As some of my co presenters have mentioned personally identifiable information PII is something that's very critical for archives to manage properly. They're under statutory obligation to manage that properly. And currently a lot of the work that happens in processing email to get it ready to be released is done by hand it's done by individuals. We're interested in taking the opportunity to evaluate Ray Tom and EPAD and bit curator and possibly some other software types against a common data set with a common restriction for time and common restrictions about staff intervention. Sometimes we hear back from our from state and territorial archives about the amount of staff time that it took them to get an email collection processed and that is very difficult to compare the accuracy of the software, the usability and the amount of time and resources when the collection types differ the collection sizes differ the amount of staff and for intervention that is available is different. We really want to make this an apples to apples kind of comparison and produce an evaluation document of the email processing software performances pros and cons and suggestions for how archives can use processing software most efficiently. Next slide please. We are also looking at developing mentoring and assistance, specifically direct assistance to support state and territorial archives that still need to develop the foundation to preserve email records. We often talk about teaching somebody to fish and that really only works if that individual or that organization has the means to do so the rod the real the net. For some state and territorial archives, they need to establish the foundational elements necessary to preserve state electronic email records. They need specific workflow plans, archival policies, they need the documentation that they can go then to their stakeholders and request funding and develop the infrastructure that they need. So we're going to be offering direct support in helping those archives create the materials that they need and solicit the financial and infrastructure support that they need. We're also going to be pairing the states that need a lift up with other states in mental relationships to provide additional support to for developing thinking plans and skills. To for those state and territorial archives to perform the work transfer preserve and provide appropriate access to the archives. Next slide please. We're really focused on community. We're focused on making the community larger the community of states that are currently managing email needs to needs to include all 56 states and territories. And we are we really are very interested in expanding also what the archival community is working on together. COSA has a strong community identity and has worked to support one another through the COVID pandemic through various emergency response issues various weather and and other problems. We want email to be something that the community of COSA works on together as well. Schedule specific times for the communities of practice to meet focus on information exchanges around specific topics support mentor relationships. Get the community to refine the materials that are being produced and hopefully someday hold some in person conferences again, where the communities of practice can meet up and be together to to forward the work. And that's it. Well, as you heard, there are many developments in archiving email and with the second round of proposals, we expect to see more on this slide is a preview of our project website where you can find important details about the program. Like how to submit a proposal and some more general general information, you can access this by searching email archives building capacity and community on Google. In this next round of proposal submissions, we hope to see a project project ideas related to the following efforts that implement a low barrier low barrier email processing workflows with a particular emphasis on e a to PDF workflows. The report titled a specification for using PDF to package and represent email highlights use cases to develop email to PDF tool sets and pathways that you can refer to this report was recently published and is available and the link on the slide. Another idea to focus on our tools to harvest linked data linked contact content, for example, integrating a web archiving tool that will capture linked content attached to emails. This is a way to get ahead of broken links and increase the accuracy of capturing the intended content project focusing on local government emails and non government and organizations are also appreciated. This next one is holding in on community aspect on the community aspect of our program goals with community email archiving, which can look like partnerships between archives and community groups or those leading social change to capture email archives and develop a framework for ethical practice and email archiving. And of course, your ideas. The majority of these projects are centered on their ability to be sustainable across a wider wider range of institutions. You can visit our project website or contact me directly for more information about second round. With that being said, thank you and please feel free to reach out to any of us with any questions. We appreciate you attending our briefing and would like to extend our thanks to the coalition for network information for hosting. Goodbye.