 Welcome, everyone, and we will be starting the presentation in about 60 seconds. Welcome, everyone, and we will be getting started in about 30 seconds now. Okay, it's time to get started. Welcome to this project briefing, which is part of CNI's 2020 virtual meeting. I'm Cliff Lynch, the director of CNI, and I'm really just here to welcome you and to welcome Cal Lee from the University of North Carolina School of Information and Library Science. Because we're doing this as a virtual meeting, Cal has been able to be joined by his collaborator, Cam Woods, and so you will get to enjoy not one but two speakers as part of this presentation. We will take questions at the end, which will be moderated by my colleague Diane Goldenberg-Hart, and at the end, when you have questions, please just use the Q&A button at the bottom of your screen. Pleaps, also feel free to use the chat as we go along. And with that, welcome. Thank you for doing this, and over to you, Cal and Cam. Thank you very much, Cliff, and thanks so much to CNI and for everyone who's been involved in organizing this. I know it was a lot of switching around on very short notice, and I'm hopeful for this couple months of really exciting talks. So we're going to talk today. Again, I'm Cal Lee at the School of Information and Library Science at UNC Chapel Hill, and Cam Woods also at UNC will be joining me in a few minutes as we kind of switch back and forth during this presentation. I'm going to be talking about as a project called Ray Tom, which stands for Review, Appraisal, and Triage of Male. And the primary motivation for all of this is that there are a lot of sort of digital curation, archival more broadly, tasks and functions that have been significantly supported by research and development in software and hardware environments over the past few decades. And arguably appraisal and selection is not really one of those areas. We haven't had a lot of progress on facilitating, supporting and acting appraisal decisions through use of software, despite a lot of advances in other areas. But we know that selection appraisal decisions are based on various patterns, and when patterns can somehow be identified algorithmically, that's a great opportunity to make use of software to try to bring those to the attention of a human being who's making decisions. Libraries and archives and museums frequently want to take actions that reflect various contextual relationships, which is also something that we can tap into. And timeline representations and visualizations, various ways to present the information can also provide useful high level views of the materials. What would be the motivation for looking at email? Well, we've been creating it for more or less about 48 years. Hundreds of billions of messages are generated every day. Most of them we know have relatively little long term retention value, just like so many other records that people create a contemporary society. But obviously some of them have a huge amount of value in documenting current activities, whether that ranges from the private sector, the public sector, nonprofit people's individual lives. We just know that, you know, there's almost not a day, even in these current conditions that we don't have at least some news story that's about something being disclosed or documented through email. And despite the presence of numerous other modalities, we all know about, right, we're interacting through Zoom, there's, you know, there's chat, there's all kinds of different environments that people are interacting with. Email still has this really fundamental role that it plays. And they're often found in collections and acquisitions with other types of materials. So that's something we really try to be attentive to in this work as well is that if you're familiar, for example, on the work that Cam and I and other collaborators have been involved in over the past decade related to, for example, acquisition of data off of disks and things like that. The reality is that very often collections involve multiple types of materials, including email. So a little bit about the Ray Tom, why this odd acronym. Well, it's a background in that the words really do represent what the project is about, which is review appraisal and triage of mail. But it's also a little nod to Ray Tomlinson and hit the role that he played in introducing email to all of us, either a hero or a villain in the story, depending on your feelings about email. This project is funded by the Andrew W. Mellon Foundation. It started at the beginning of this calendar year and ends, sorry, last calendar year and ends at the end of this calendar year. We're developing and repurposing software, including machine learning and natural language processing for selection and appraisal to build on top of existing environments such as the bit curator environment which has also been developed through funding from the Andrew W. Mellon Foundation, and also hooks into other software and enhancements on things like Tomes, which is software that came out of a project at the State Archives of North Carolina, who are one of our primary collaborators. And we're looking at trying to support what we've been calling iterative processing. So the idea is that as more information unfolds, you can make more and more decisions about things like appraisal. What is the content, how it should be described, what should be retained. Also mapping of timestamp and entity by which we mean named entities that can identify, be identified with natural language processing software. Sensitive features, things that appear that might need to be further reviewed or redacted and other elements across the tools. The team members are myself and then there at the bottom you can see Cam Woods who will be speaking to you in just a few minutes. Also at UNC Chapel Hill is Alicia kinder who's our project manager and to end a tour say who's our software engineer. And then we have three partners at the State Archives of North Carolina Camille Tindall Watson, Jamie Patrick Burns and St. Gita Desai. And then finally we have a set of collaborators through a company called cactus group or we're doing software development on an interface that will be showing you the end of the presentation. So the scope of the project has several core development goals designed to serve specific needs of collecting institutions that are dealing with email. We're fairly focused on the use case that someone has pulled down the email from a server into something like PST OST inbox, as opposed to the idea of pulling things live off of, for example an IMAP server. The whole tool chain could be applied after that but we're not ourselves developing a set of tools that can be used to pull information off of the server itself. So developing utilities to support entity identification and export reports suitable for conducting automated and human directed redaction actions at scale and that at scale is a very important part that Cam will be talking about. Developing an interface to allow processing archivists, librarians, museum curators to browse email collections and mark them as suitable for retention or needing further review for sensitivity. And also developing utilities to apply machine learning to be able to support the kind of influences that human beings are making about them. So from the very foundation the idea is that this is computer assisted processing and appraisal it's not that it's automation of these activities without a human in the loop. So what I'm going to do is stop sharing my screen and hand this over to Cam if there are any questions that people have during the transition from from me to Cam we can certainly be on them to answer them. And yeah as as cams bringing that up I'll point out that also in that first slide there was a link to where you can get the slides and it's also in the chat you should be able to see it through zoom so if you want to follow along at any of these slides after the fact feel free to do that. Great. So can everyone hear me Cal is that okay. Yep, I can hear you. Great. So I'm going to be going through these fairly quickly again as cal side you can get to these slides there's a lot of information on them will leave some time for questions at the end. So as part of this process of kind of supporting these tasks that Cal talked about we have we have two kind of parallel tool sets and development. The first is live what we're calling liberate Tom which is our core library and this is essentially a tool that draws together a variety of different supporting libraries to allow us to in an integrated way process pst ost inbox email formats extract entities from the contents. Generate kind of research quality data sets very quickly and for arbitrary size collections. There's a link here at the bottom to the GitHub where you can find detailed information on all that and that the second, the second tier of work here is this iterative processing interface that Cal was referring to which kind of draws on the core functionality of that first interface and provides a web console essentially that allows users to to mark messages within various collections for redaction retention, open access and so on. And there's a there's a GitHub link there that you can find in for more information on that. And the information here for this is again as cal said that lots of institutions have unprocessed pst and inbox files. We need better ways to kind of quickly ingest those those data sets and do these two things a be able to generate research quality data sets that that we can perform machine learning tasks or various types of analysis on and then feed those into these other types of interfaces like the this iterative processing interface. Our tool set uses some core libraries from digital forensics industry standard open source NLP platform and the goal is to produce these data sets that that you know can be verified that can be reproduced that are reusable and can quickly be transferred between environments irrespective of how they were produced. We, we dump all the contents of these data sets into fairly easy to use SQLite databases. I have a, I have a slide upcoming that'll, that'll include that that'll include that schema but I'm not going to spend too much time on it here. And the motivation for that is, you know, generating these research data sets gives us sort of a powerful tool to, to, to expand our, our, our use of these email sets so we can, you know, push this data into machine learning tasks into statistical analysis tasks and so on. And to do this we want tools that can reliably and efficiently extract those features we're looking for that support swapping in a different models which our tool does some models for different languages or different entity types that scale to large collections and we've put a significant amount of engineering in here for this type of scaling and that allow you to export the data and useful formats. And the schema I mentioned I just left this slide in because I just want people to have a reference for it but I'm not going to discuss the schema directly here. So here's an example of just to give you an idea of how what kind of skill this tool works at the core Libre Tom tool when we run it over the Enron corpus, which is about 54 gig about 800,000 messages almost 200 files. We can scan the entire PSG structure in 30 seconds we can generate a simple were SQLite three report without any entities just the core structures and the core metadata attachments and attachment metadata and so on in about 12 minutes. And we can extract almost 20 million entities from this tool set in about two hours on a 16 core system so so we can do this very very quickly on relatively modest desktop hardware. And what this gives us the ability to do is to go back and sort of tweak these tweak the parameters on these on these runs run them in run sorry run the tool of multiple times on different systems and so on without worrying about having to use a lot of compute resources. Here's an example of the kinds of entities that are that the tool pulls out. So if you're familiar with spacey you'll you'll note there's about 18 different entity types that it can generate including organizations people, geopolitical entities and so on. The, the quality of this data is is quite high so even on even using a small pre trained model. We get an F score about 85 on on open text data and email so there is some noise but it, but for for non individual email tasks for something a human's not going to be, you know, making, making an analysis decision on directly this is this is fairly good quality data to support large scale ml tasks and other types of analytics. You can find all these tools on GitHub are our releases are pushed directly to Pi Pi so they're very easy to easy to install and we have some interactive notebooks that you can use as that you can play with as well. That kind of show some representative features of these tools in a in an interactive environment that doesn't require you to install anything. Yeah, and I'll turn it back over to Cal for the for the last section here. Thank you, Cam. So, so this is the processing interface we had mentioned this company cactus who is cactus group who are in Durham not too far from us in Chapel Hill. And they've been working with us to develop this processing interface so the idea is that this requires very little installation on the part of the client somebody just needs a browser it does have server software that runs in the background to facilitate all these actions. And it's basically the next step in the tool chain the things that cam was talking about in terms of extracting the metadata from the PSTs to very, very efficiently identifying, you know, timestamp metadata how many messages there are how many attachments all those sorts of things are done and then fed into this interface so that someone making decisions can basically search over navigate and tag the messages. So the idea is direct import of email corpora from PSTs or inbox with automated entity identification. So entity again we mean things like organizations, events, people that are identified by the software called spacey that cam identified earlier, creation of processing accounts associated with individual email stores and backups, interactive review and tagging export of selected messages as a mall for retention and or release. So our two main use cases here are archival appraisal and processing where you're making decisions around the email you're trying to flag them as something to retain something to review. The other use case and this comes from our work with the state archives who are primary partners here in North Carolina is fulfillment of open records request so if somebody comes and says give me all of the email relevant to this particular matter, you can conduct the queries and then extract the email from the subset that you've identified here. And as I walk through this you'll see that it's based on this essentially a faceted browsing kind of model where you can identify things like everything in a particular folder is the next functionality that's going to be rolled out in the next release. You can narrow things down and then sort of select them for export or for tagging. So at this level you see the individual accounts, and then these green boxes are these entities that have automatically been identified by spacey and associated with them so this is, you know there's an organizational name there's a date associated with it. In this in this view you're seeing the various accounts associated with imports of one or more imported PST files. It's showing your their status have they been you know have they been pulled in did it fail are they still in process. The top level messages review within an account so this is going into this particular person's email. And then seeing the email messages that's the view that you saw in that lead in slide with the organization law event and date. So those green ones are automatically assigned the thing that you see as status if this were interactive and I were running the software I could click on that and there's a drop menu that shows a set of choices so you could say, This is an open record this is something that is a non record content this is something that shouldn't be publicly accessible. And so those are things that can be done at an individual message level or in bulk by selecting all of the messages that conform to some particular criteria. This is then individual messages being viewed within the client. So you click down to the individual message level. You can see in the top right corner there this is one in which that open record status has been identified so someone has gone through process and the determine that this is something that should fall into the open record category that could be shared with the public. You can see in this case the message is unformatted one of the design decisions anytime rendering email is how it should actually be presented on the screen because email is often formatted in rich text HTML plain text all at the same time. This is selection by classification so for example you could narrow down and show me only the restricted content from only certain email addresses over a given date range so again it's this sort of faceted model where you can narrow down. Based on what those criteria are and then the results that you get can be something that you identify for retention that you review further for archival description that might be subject to. Provision of access if they're part of an open records request. And then this is some of the administrative back end that shows you for example when actions are taken. There's this audit log right so for example if somebody changes the restricted status to a message that switch of the flag from true to false is then indicated within this audit history. So this is a reminder of where you can find the information about our project retom.web.unc.edu is our general project website so that's more about you know hey we're going off and doing some event or something. Primarily in terms of the software the place to go is GitHub where you can find the documentation and the walkthroughs and the software to download. As Cam mentioned we have these Jupyter notebooks if you're not familiar with Jupyter notebooks there are a nice way to be able to break down essentially within a web page. A set of code so you can go sell by sell and execute the code to see what it's doing without having to install anything in advance on your machine. And then those processing interface deployment tools which are more in development. You know more in earlier development than the others were as a reminder four months into the second year of the project. So a lot of the continuing work on this project is going to be testing out deploying and using that interface but also feeding all of these things we've just talked to to you about into machine learning processes so that's a lot of the work that's forthcoming in the remainder of the project. So that's what we have prepared we'd be very happy to answer questions. So maybe I'll leave that that slide up temporarily. Just in case anybody wants to follow the links and again feel free to download the slides as well that we're shared with you in the chat. Wow, that was really a fascinating. Thank you Cal and cam for that really interesting talk and for your extraordinary work on this amazing tool. It's just a really phenomenal contribution. Thank you so much. And we're really delighted that you came to see and I to share some information about this work that you've done. And with that I'd like to invite our attendees to submit any questions that they may have. If you look at the bottom of your screen you should see a little box that says Q&A and please feel free to submit your questions there and I will be passing those along to our panelists. While we're waiting for some questions to come in just a little bit of housekeeping on my side and I should start by introducing myself. I'm Diane Goldenberg Hart with CNI and you have all probably seen my name in your inbox more times than I wish to remember. So apologies for that, but I just want to welcome you to CNI's spring virtual meeting. So glad that you could join us for this project briefing I hope that you will make time to join us for the many more that we have planned for you. I want to go ahead and share with you in your chat box now. I have a link to our schedule. If you want to check out what we have coming and I see right now that we have a question that I want to go ahead and let Cal and Cam address. And the question is, are there ways to direct curation tasks to different individuals? I mean feel free to jump in on this to Cam, but yes in terms of that web interface that you saw near the end of the talk, that's part of the functionality built into it is the ability to set up different, you know, accounts. Right now I think it's only limited to an administrative account and just sort of typical user. If you wanted to specify permissions in more detail than that then it would require some additional work and we're very interested in hearing people's thoughts about how that division of labor should be implemented. But I believe that's correct right Cam I think that in terms of that particular part of the software, the two categories are basically kind of admin and general user. We wanted to restrict workflows like sort of forensic styles, you know, to have certain people working on very specific sections that's not something that's explicitly part of the interface right now. You know, we do have a fairly large backlog of modifications, you know, improvements to the software, ultimately it's about funding and time and this is our kind of initial eight week effort on developing the this interface to begin with so. And yeah. And in terms of prioritization the reason why more fine grain breakdown hasn't been on our kind of short term plan is that both the state archives and the other partners we're working with most closely are usually using a relatively small set it's like one or two that we're doing this work. So it's not a really high priority to bear have a very high grain, sorry, a very fine grain set of roles but obviously that's something we can revisit. Let's see what else do you see in the email preservation area and how does this fit in with those other interests and developments yeah so great question hi Don. There was a there was a report funded by the Andrew W. Mellon Foundation that I'm sure many of you are familiar with that looked at there was a task force looking at issues related to email. And so, you know this this project was greatly inspired by that report there's another project that I'm involved with that's going to be disseminating its work quite soon that was looking at it's very exploratory way at ways to better facilitate generation of PDF from email. So if the only option you have for example in a government agency is to generate PDF from the email. That work is to further specify what what is the guidance for how you should actually generate it so that it'll be most easily navigable and machine processable over time. Many of you I'm sure familiar with EPAD which is work that's been ongoing. There are some differences both in terms of the primary use cases and the design approaches to for example EPAD and Ray Tom. We collaborate very closely with each other, you know members of the EPAD group are on our advisory board we've been on their advisory boards and collaborate with them quite closely. One of the distinct differences is that we try very hard to make sure you sort of saw this when Cam was showing that database structure. Right that the idea is you run these tools like Libre Tom or the Web interface where you do the tagging, all the data gets written to a database that's this very simple SQLite structure that then can get pulled in by other software. So we're trying to make it like as low lock in as humanly possible so that basically you're running the tool to do the processing you're doing it with. But then you can rely on other ways to query and access the data and that's motivated largely by one of our main use cases which is supporting machine learning is that we don't want you to have to sit in a particular environment to do those things. We want the data to be really easily machine readable so that you can pull it into other processes. There's definitely quite a bit going on and it has been, you know, pretty heavily influenced by the task force report that I was just mentioning because it is. It's quite bizarre in a way if you were coming into the library and archives world from the outside to realize how little attention and resources has actually been put into the curation of email given how long people have been using email. I think when it comes to just the core preservation issues as opposed to these larger curatorial ones. A lot of them are shared by any other preservation issues of just software dependencies and you know what is the architecture you're going to use to save them. So for example, PST falls themselves are kind of a nightmare and probably not the best preservation format. But then you have these issues of how you write the data out what kinds of structures you attend to in which you don't. Right, but it's a changing landscape in which there's been quite a bit of activity in the past couple years. So I would say stay tuned to some of those things. It's really interesting. Yeah, thank you. Thanks very much and Don thanks you as you can see. If we have any more questions, please go ahead and type those in. I should also say that in this environment we have the option to hand the mic over to anyone from the audience who is interested in engaging directly with Cal and Cam. Either for a direct question or a comment. So if you would like to do that, please feel free to raise your hand and I'll move you into the mode where you can have your microphone turned on. And while I'm giving folks a couple more minutes to think about what they may want to ask while we're here. Again, just another little plug for our meeting. We have recently put out a new call for project briefing proposal the virtual phase. We are soliciting proposals on issues having to do with the current crisis. If you are working on issues that have to do with the COVID-19 crisis and related topics. I would urge you to go ahead and submit a proposal we're trying to roll these out quickly during our meeting and the meeting will last through the end of May so we have a lot of time to be thinking about the issues that are really coming to the fore for our community. So right there in your chat box you should see a direct link to the proposal form and the call that describes what we're looking for. I would, I'm sorry I didn't want to interrupt. Thank you, Diane. I just wanted to add one more thing that occurred to me from that earlier question is also as I was talking about this kind of database structure that we've created. We've, we've been working very closely with users of both Archive America and Preservica also as examples of ways to make sure that however the data get generated they can get pushed into these other spaces because this is a consistent theme of the work that we've done through support by the Mellon Foundation, whether it was Bit Curator, Bit Curator Access, Bit Curator NLP, we're always trying to create this kind of processing environment that allows you to relatively easily hand stuff off to Archive Space, Fedora, whatever it might be as opposed to thinking you have to sit inside of that environment the whole time. And so the reason I brought this back is I wanted to, to make sure that we added the plug that you know we can only really facilitate those handoffs. When there are people out in the profession experimenting with us with them and telling us what works and what doesn't right so if you're using any of these tools and the outputs from them, don't conform to what you need to go into some other environment. We just don't feel like it's going to offend us to hear that we need to hear that so that we can figure out how to support those kinds of processes. Also we have about eight months of development left on this project so this is, there are changes coming as well. Yep. So you know we're a relatively small team and we rely a great deal on people in the profession trying out the software and telling us you know what's broken what's useful what needs to be fixed so please let us know. Mind or do you have that frame with your contact information that slide maybe you want to put that up. Yeah sure I'll go ahead and bring it back up again. So thank you both again. There we go. Ways to get in touch with Callan cam. If you want to talk to them further if you've got some feedback or ideas. Thanks for sharing that. And before I close out this webinar just want to let Callan cam if you've got any last closing thoughts. I don't I would just say you know certainly stay in touch and let us know if any of these things are useful to you in ways that we expected or didn't expect things you'd like us to try to prioritize that would make the software more useful to you. Well I think we've already mentioned this but it's all free and open source software to so if you hate what we're doing but love our code grab it and fork it and do something else with it we hope people don't do that but it's all out there for people to use as has been the case with all of this work we've done through support by the Mellon Foundation all the code is available through GitHub. That's great. All right. Thank you. Thank you so much. Unfortunately we can't hear all the applause that I know is happening living rooms across America right now but Thank you so much for coming and sharing all of this with CNI we appreciate you spending your time here Callan cam and also to all of our attendees be well and take care and hope to see you again very soon. Bye bye. Thank you everyone. Thanks Callan cam. Take care all.