 All right. Well, good morning, everyone, again, and welcome to the first session of the 2023 DC Archives Fair. I'm very proud to introduce my colleague from the Electronic Records Branch, the Electronic Records Division, Rebecca Baker, who's going to speak to us today about sharing NARA-developed tools on GitHub, which supports our accession of electronic records tools, the sectioning, accessioning of electronic records. Rebecca is the accessioning branch chief for the Electronic Records Division, and she has developed a lot of great tools to help us with our work in the Electronic Records Division. And I'm going to give it over to Rebecca. Thanks very much. Thank you, John. Can everyone hear me just fine? Okay. All right. Thank you. All right. That sounds good. Okay. So good morning, everyone. As John had said, my name is Rebecca Baker, and I am the chief of accessioning within the Electronic Records Division at the National Archives. And I would like to discuss an effort that we've done, essentially over the last decade, but the tools and resources that are available to the public. And in March of 2022, our Electronic Records Division worked with the Office of Innovation within NARA to launch a GitHub repo of accessioning support tools. And my presentation will review the tools that we've made available to the public, the impact of GitHub, and discuss how agencies, organizations, and individuals like yourself can use these tools and use those to document your electronic records and extract metadata and so forth. And I've made sure in my presentation slides we have the URL for the GitHub repository. So next slide, please. Sorry. There we go. Okay. So the accessioning support tools GitHub shares tools that the National Archives has developed. And our whole goal was to help agencies prepare permanent electronic records and metadata for transfer to NARA. Currently, we have four tools with job aids available for download. And here is a list of the available tools, and I will go in the presentation in greater detail about what each tool does. So there are a variety of drivers of why I'm here today, why we've developed these tools and why we've made them available to the public. So firstly, when NARA developed Bulletin 2014.04, the format guidance for the transfer of permanent electronic records, which was published in January 2014, the scope of file formats deemed acceptable and suitable for long-term preservation was expanded tenfold. And as a result of this expansion of the types of file formats that the archives would expect to receive, the electronic records division formed a tools research and development group to identify tools to support the processing and access needs in compliance with our NARA regulations and requirements. Afterwards, NARA Bulletin 2015.04 established metadata guidance for the transfer of permanent electronic records. And this defined a minimum set of metadata that must accompany all transfers of permanent electronic records. You can find these on these links on the archives website. The nine different elements that NARA deems as mandatory. And let me see. So most recently, you've probably heard of the joint memorandums M1921 and M2307 that were issued by the National Archives and OMB. And these were both focused on a government-wide transition to electronic records. And as a result of this required transition for kind of the end of transferring paper and the adoption of agencies have understandably requested that NARA provide resources and guidance to meet these requirements. And the following tools can assist these efforts. So each tool that has been made available on our GitHub was designed to perform very specific tasks with a simple interface. Our goal was to develop a simple graphical user interface or GUI and have about four buttons to select files, perform the task, and export the results. These tools were developed through a joint effort with the University of Maryland Baltimore County Computer Science students. And this was an opportunity for these students to gain experience programming, essentially production level tools for the federal government and for NARA to expand processing capabilities for the electronic records that we've been receiving. And then a little bit of technical about the tools. These tools can be pointed to file directories such as shared network drives, removable media. If your cloud-based storage platforms are configured to show directories within a file manager like File Explorer, then you can direct the tools to your SharePoint, Google Drive, Box, Dropbox, and more. But the key element is having that configuration where you can view the directories. So for the next four slides, I will cover each tool that we've made available. So the first tool that we added to our GitHub and the one that agencies had been asking the most for was something that could extract metadata. So File Lister was, understandably, the first tool that we added. It was designed to extract metadata from all folders, subfolders, and files in a user specified directory. Metadata elements that are extracted include the file name, the byte count, the file extension, modification date, directory path, and a SHA-256 hash or checksum that acts like a digital fingerprint for the file. This tool is responsive to agency requests for support with NARA Bulletin 2015-04, the metadata requirements that NARA has made, and finding aid requirements. So NARA itself, our Electronic Records Division Archivist, run this File Lister tool on all the accessions that we receive from agencies to establish some sort of intellectual control over the accessions. We'll get a total file count, a byte count, an understanding of high-level file formats that we have, and then we can use tools like our digital preservation risk matrix and action plans, which are also on GitHub, and I highly recommend exploring to understand what we want to retain as is, what have a higher preservation risk, so we would want to do any type of conversions or transformations, and so forth. So for this specific tool, it would benefit agencies so they can, if they run the tool, they can understand what files they're transferring, they can have a log or a manifest of the records that they have transferred to NARA and retain that over time, and if the agency runs this tool, and then NARA runs the tool, we can compare the SHA 256 hash check sums to ensure that there's been no modification, change, corruption, or anything like that over time. So that's one of the key elements and value ads for the agency running this tool as well. The next tool that I'd like to discuss is Junk File Finder, and this tool identifies and removes non-record material, such as zero byte files, backup files, hidden folders, and empty folders. Again, users can specify an input directory, configure the find mode, if they would like to find all of these, or if they're only looking for zero byte or hidden files, run the tool from the GUI, and then once the tool is finished running, users will see within the application window the results, and they can decide whether they would like to delete any of the files that meet these non-record, well, potentially non-record criteria, or they could just simply log the results. They could log the results, run it again, and delete later, so it's accommodating to whatever your workflow process would be for removal or culling of non-record material. And then one of the key factors, like we, NARA runs these tools like Junk File Finder. Once we've ran File Lister and we understand what we have, if we have a number of zero byte files, this helps us log and then have a defensible deletion that we're showing we've logged it as such before we took any type of action to remove them. And for the agencies, if having these files removed by the creating agency prior to accessioning helps filter out non-record material, helps them get a better understanding of what they're transferring to NARA, if they're exporting from a system, it can help them understand and talk with their IT about the creation of hidden files or understanding their backup files. And once we receive accessions that have that material removed, it's more seamless for us to process them and preserve them in our repository. So the third tool that I would like to describe is called File Compare. And this notes the differences between two lists of file names. And this tool can expedite the process of ensuring that external metadata matches the files being prepared for transfer. Agencies could use this tool in comparing file lists of accretions, so if they've sent us the initial transfer in a series and now they're going to do an annual follow-up, they could compare their file list and make sure that they aren't sending us duplicative material and that they're only sending what's unique. And this would again help the processing of these records. The final tool that I would like to discuss is a funny file name finder, and we added this to our GED Hub just about two months ago. And this allows users to identify invalid characters, so these are non-standard ASCII characters like symbols, punctuation marks. And the reason why it's important to identify these and do mitigation activity is you can have trouble whenever you're moving files from one directory or location to another or attempting to read files if they're getting over that 255 character limit for the full path or if you have multiple periods. We've noticed, especially in accessioning material from the 80s or 90s, that we would have all sorts of file name conventions and would need to kind of clean them up or understand why a file isn't able to open in an application that we know should open the file. So now I'm going to transition to a bit of an overview of our GED Hub. So GED Hub in general is an online platform that's used to share code for open source software projects. And this is any type of project that a developer has chosen to make the code base available to anyone to review, enhance, or reuse for their own. These open source software are generally available to users at no cost, which is the case for our repo. It is entirely public, free. The only thing we don't guarantee is support. If you have trouble with it, you will try to work with you, but these are essentially the tools that we've developed and we aren't intending to kind of further develop them much beyond that simple specified task. And the National Archives maintains a presence on GED Hub, which includes a number of open source projects that have been made available over the years. And you can access NARA's GED Hub profile at GED Hub dot com slash U.S. National Archives. That's where the digital preservation repos are, where our repo is. So I'd highly recommend looking at the tools that NARA has made available. And so now I'll go over a bit of the process of how we add tools to our GED Hub and this slide has a screenshot of what our repository looks like. So we had the Office of Innovation, which is behind the U.S. National Archives catalog. So they managed the GED Hub and they helped us by making our accessioning support tools repository and they created a designated storage space for us and this contains all of the files and documentation for each of our tools and the tools can be downloaded and used by anyone. They are Java jar files and the files themselves are in the bin folder and the readme PDFs are in the documents folder. So you can navigate to either and on our home page we have a description of each of the tools. So NARA has raised awareness about these resources through a variety of methods. We have shared these in our bi-monthly records and information discussion group or bridge meetings. We have worked with agency services to issue AC memos including one we posted last month and we also do direct agency outreach. So my staff, NARA primarily receives electronic records through direct offers or an agency actually offering the records. We don't have it a kind of annual move which is the typical way for paper records. So when an agency offers this we tend to get into a back and forth conversation and we discuss metadata, file formats, volume, rights or any type of restrictions on the files and that's when we encourage agencies typically to if they can use our tools. And I definitely would want to give a disclaimer to please acknowledge any IT workflows or approvals that you need in order to use the tools. These as Java jar files they are not an exe file that requires installation. So once you have approval if you download the tool you can essentially double click the dot jar file and it will open and run. So that's why I definitely as a disclaimer say make sure your IT is okay with that. We've had agencies when we tell them they've told us they've gotten approval it's been seamless and no issue and we've worked, there are some agencies in the past year that have given us some really good feedback and adopted use of the tool and are now working with their records custodians and liaisons throughout their agency to adopt this and that was from the Department of Education, the Coast Guard, the Office of Navajo and Hopi Indian Relocation and the Bureau of Labor Statistics just to name a few. So here is my contact information please feel free to contact me if you have any questions or our general electronic records division email box e-transfers at norah.gov. I would be happy to discuss any of these tools or if you'd like to learn more about anything else with the electronic records division and I guess as a general update we are working on new tools. We are currently testing updates to the open source tool Apache Tika and Apache Tika is not something on the U.S. National Archives repo but it is on GitHub and my staff Greg Lepore has been working with Tim Allison who was a data scientist at NASA's Jet Propulsion Laboratory and they were making some really significant updates to Tika to identify encrypted files, PDF portfolios, extract contents and work files or web records, let me see, password protections and all sorts and getting these really robust CSV outputs that were then working on a Tika report generator GUI so you could have essentially metadata spreadsheets for each record category within a mixed accession and you could get metadata that's specific to each category of record and this includes a pronom, identification, droids, sig-free so we get really into the weeds with Tika but and last month we posted a part-time remote detail within the National Archives internal collaboration network and we are currently working with individuals at NARA who can write code and develop applications in Java and Python we are trying to evaluate the libraries that we use for these tools and make sure that we can keep them as up to date as possible that we make sure the tools are as efficiently coded as possible so they can run better and then we also evaluate if there are any exploits like the log J4 exploit that impacted Java applications last year we evaluated and ensured that all of our tools were not impacted by that and we're safe to use so I guess I'll open it up if anyone has any questions that's a great question and I do know that agency services within NARA has been extensively testing M365 and working with Microsoft I take part in monthly meetings with them to as a stakeholder to kind of understand these I know they've been testing in a G5 instance I guess as a disclaimer the National Archives uses Google Suite so we are a little removed from some of that but it's something that agency services are policy and standards counterpart is extensively looking at and because some of these tools as you can see the modification date that it's extracting is could potentially after that upload be a false date so well then I would highly encourage using these tools now so you can run it on the network drive environment and have that pre migration metadata and kind of have that as an authority that this is what it actually was when it was created and I know and I know that this our tools don't have an exhaustive list of all metadata elements it's kind of just getting a starter and then the records creators can or custodians can kind of add that metadata like creator or the rights or anything attached to it but I totally acknowledge those legitimate concerns are there any other questions or anything so there is the link for our GitHub repo again and please don't hesitate to reach out if you'd like to talk about the tools or any questions that you have about them hope if you have any improvements to our documentation but thank you all for your interest in this I greatly appreciate the opportunity to speak here today too so thank you thank you Rebecca I think ASL presenters are happy that this is over too I think you gave them a workout with some of those acronyms apologies and plain language act isn't always my strong suit so so we are at just around 10 o'clock and our next session will be at 1045 so that gives us a fair bit of time there was a fair bit