 Hello. Thank you for joining us for a presentation on using machine learning to expand access to World War two in cursory data. This work is part of a project funded by the National Park Service Japanese American Confinement sites grant program, and it's being run through the bankrupt library UC Berkeley. Next. My name is Mary Aylings I am the interim deputy director at the bankrupt library, and I'm the principal investigator on the grant. I'm joined today by Marissa Friedman who is the project manager and digital project archivist at bankrupt library on the project and Cameron Ford who's the co founder of Doxy AI, which is a company formed by former UC Berkeley school data science students. And he's here to talk about the machine learning work that we're doing with them. So our project was conceived in 2018 to be different from our previous for Jack's program grants, in which we were basically digitizing and publishing archival records related to the internment. So for this project, we are going a step further in trying to extract data held in these records, so it can be used for computational research. Next, we've been exploring this kind of work as part of our engagement in the collections is data work, which encourages computational use of digital and born digital collections, and supports ethical access to collections, and data that we steward. We were codified by Thomas Padilla at all in 2017 in the Santa Barbara statement on collections is data. So our project is in alignment with these principles, and we hope we will also develop these as future models for our digital initiatives here at bankrupt. So the goals of our project are four fold. First, of course we are creating preservation images of the unique resources in our collections and preserving those for the long term. We're also creating a data set to build in this case a more accurate data set and more complete data set to represent the Japanese American incarceration. We're required to engage with community partners to guide our ethical data curation plans, and this is part of a sort of a co curation model with community members to talk about ethical access or responsible access workflows to our materials. And then finally of course to iterate and implement tools and workflows to expand computational access to our digital special collections again in line with the collections as data workflows. The transition of our project really is about finding a way to efficiently and effectively extract data held within our digital collections, and to increase use of that data through computational research methods. We'll be talking about the material we're working with in this specific collection which it lends itself to the machine learning process and at this point I'll turn it over to Marissa to tell us more. I'll start with a little background on the records that we're working with the WRA or war relocation authority used a census type two page form known as form WRA 26 or individual record to collect demographic educational occupational and biographical data about every American who is incarcerated in one of 10 WRA relocation camps as they were euphemistically referred to. From 1942 to April 1943 in cursory is interviewed a rivals in the camp to collect information about individuals ages birth dates birthplaces skills and hobbies health and physical defects. The dates and heights language proficiencies education backgrounds occupations and even religions. Next please. The WRA 26 forms are primarily type written, although several thousand forms from the bankrupts holdings have entirely handwritten responses. Many forms contain stamps handwritten corrections strike through notes and other marginalia which you can see some examples of this on the slide. And the existing set of form 26 records at the bank cross library numbering over 110,000 are believed to be the only remaining complete set of form 26 records organized by camp in existence. Next please. Data from the forms were coded by incarceraries and other WRA office staffers to early computer punch cards during World War two. At the conclusion of the war, a copy of the punch cards and the original forms from which the punch cards were coded were deposited at the bank cross library. In the 1960s the library worked with what was at that time the nascent UC Berkeley computer science department to transfer form 26 data on the computer punch cards onto magnetic tape. The office of redress administration or ORA acquired a copy of this data in 1998 or 1988 to aid with identifying and dispersing reparations to former Japanese American incarceraries. When this work was finished the file was transferred to the National Archives, and a copy of the data file was also acquired by the Japanese American National Museum in LA. It quickly became a popular information resource for those formerly incarcerated and their families. The National Archives published the data file it acquired from the ORA in 2003 as part of its access to archival databases project referred to as the Japanese American internal data file. It currently serves as an authoritative resource for genealogical and statistical information for former inmates and their family members, as well as social science researchers. Next please. There are however problems with the existing data file that we are hoping to start to address with our project. The first gaps truncations and errors were introduced over the course of the many data migrations that I previously outlined. There are also a number of handwritten annotations or corrections on the original forms that may or may not have been integrated into the data file at NARA. There are also many fields that are missing are not fully represented. The NARA data site currently has 36 available fields, while we have identified over twice that many distinct data points in the original forms. Information present in the original forms that missing from the NARA database include things like significant activities, skills, hobbies, educational and employment history and more. If you compare the top image on the slide, which is a snapshot of an individual's educational history from their form 26 to the bottom which is a screenshot of what appears in the NARA data file. You'll see that a lot of granular detail of educational institutions, dates, etc. are missing and the educational history has been coded to represent only the highest grade level completed. Next please. Some information was coded to or collapsed into a predetermined set of classifications. This is the case with things like occupational categories, which you can compare in the bottom two images of the slide. The left is a snapshot of what appears in the NARA data file and the right is what is represented for that individual on her original form. As illustrated on the top image of the slide, the employment history section found in the original form contains salary information and other crucial details that are missing in their entirety from the existing data sets. But we believe that they can provide valuable historical data for researchers. So what's important to note here is that there's a tremendous loss of detail and missing information in the existing data file. And we believe that the digitization of these forms provides a new opportunity to bring this data to light, particularly by leveraging the potential efficiency of machine learning. Our goal is to create a new, more granular data set that improves upon the existing data file. Next please. So to give you a sense of the scope of the project, I will I'll just quickly walk you through the major stages of the project from my vantage point as project manager. The first step is digitization where you know that the digital or the digital archivist prepares the original forms for shipment to an offsite vendor for digitization. So once we have the digitalized files, we can begin to apply a trained model to read them. Initially in the document discovery phase, the library work to doc CAI to establish a working data model, which identified the data model. As the project has progressed, however, we've also made quite a few discoveries about the content and structure of the data in the original records. And so library staff collaborate with doc CAI to integrate new data into the data model. And so as the project has progressed, however, we've also made quite a few discoveries about the content and structure of the data in the original records. So library staff collaborate with doc CAI to integrate new observations about the records into the pipeline as we go. The next stage is testing as Cameron will address we've adopted an iterative approach to developing the pipeline, which handles sets of form 26 records one camp at a time, do in large part to the variability in form content and other characteristics of the original documents. So set of forms from a particular camp, we walk through any anticipated problems, and the doxy team performs initial testing to make any needed adjustments. Once initial exploratory results are inspected and improved by both doxy and library staff, the pipeline is frozen in its current state, and all the files associated with one camp or run through a customized OCR pipeline, which has targeted pre and post processing interventions to improve the quality of results. Then there's data storage doxy uploads the extracted data to a shared private get her repository with some personally identifiable information such as social social security numbers already redacted library staff provide feedback on results and note areas for improvement and preparation for the next camp. Then there's data cleaning bank rough staff uses tools like open refine to start to clean and normalize data. We also may later anonymize the sensitive fields based on input from our community advisory group. As Mary has mentioned we'll be working with this advisory group to help us to think through any ethical issues pertaining to providing digital access to the data and to the digitized forms themselves. Then there's data publication, the data and accompanying documentation will eventually be published on a public get her repository, and the access copies of the digitized forms will, if deemed appropriate, be available for public view on UC Berkeley's digital collections platform. Now I'll turn it over to Cameron to discuss the implementation of the pipeline and our results in greater detail. Thank you very much, Marissa. And on behalf of Doxy, and we feel very humbled to be part of this project with bankrupt. It's a vital project for the community and we're very grateful that this has been our sort of founding project as a group of data scientists recently entering the field after graduating from the high school at Berkeley. This has been a wonderful project for us to engage on. And part of what I want to talk about today is Marissa has walked you through kind of the journey of the data from incorporation together of bringing the data to Doxy and then the processing passing it back and then what will happen afterwards. And the collaboration piece between bankrupt and Doxy is a critical part to the success of this, this venture and so I want to really talk about the value of that partnership and, and why we at Doxy believe that this, this model will be very useful going forward with similar projects. One of the things that we're really talking about is automation, and it's about these, these forms contain very rich data, and it's challenging to capture all of that data through manual efforts as seen by the previous attempts where it simplified the data quite drastically to be able to likely move, move through it in an efficient manner. But in an attempt to capture as much of the data as possible, that requires a lot of automation because doing everything by hand is untenable. And so that really is where machine learning comes in. And standard machine learning is the combination of these top two circles here, where it's taken into account coding as well as statistics, and that's where you'd see a lot of standard models in the marketplace. The challenge with these standard models is it doesn't incorporate anything about the domain that you're approaching. So much of the development in today's world is around how to automatically read forms that are present today such as tax forms, maybe receipts, different papers that we would encounter on a day to day basis, not these incarcerary records from 70 years ago. And so there's what this presents is issues with the models not being prepared to handle the uniqueness of the data, which results in vast amounts of inaccuracy, whether it's reading parts of dotted lines as symbols or vertical breaks in the page as letters. And part that what that returns is a big blob of text and symbols, which can be searchable potentially, but it will be really hard to then parse that apart, and that leads to potentially mishandling sensitive data. So if you have sensitive data like a social security number that that's mixed in with these other artifacts that appear due to the inaccuracies, it can you can potentially be leaving sensitive data present. And so that's that's really vital for us to be able to handle appropriately due to ethical concerns. And then lastly, there's a lot of unique fields unique ways that this was developed due to being type written or handwritten. And so having a field by field approach allows us to handle those appropriately. An example this are the checkboxes that you saw, for example around languages spoken and understood. The benefits with the doxy partnership with bankrupt is that we together the two organizations are leaning in to share on doxy side machine learning knowledge and domain expertise from bankrupt side. So there's a deep understanding of this field and of these forums that are vital to the development of a pipeline that handles the data in a sensitive manner. And then do is go field by field so we develop a pipeline where each individual piece of data we ensure that we're capturing that as accurately as possible. Then we go through the, we also have the process of handling sensitive data scrubbing things out as as appropriate. And then what's really great in the field by field approaches we can put in place specific qc processes such as setting a dictionary of a potential terms we would expect out of a particular field. Or if we compare age to the year of the form, we can see or sorry the date of birth, we can flag something as hey this, if we read it incorrectly, you probably need to review this and so we can add aspects into this data and look at it uniquely. And this really then again at the end of the day what it really comes down to is saving time and cost, because we know that funding is challenging and time is challenging in this field and so it's, it's really vital that we partner together to find the most efficient ways to make this data accessible. And the process as Marissa mentioned is really a collaborative one that has that. If we break it down into the simplest form is we start with bringing the data in designing around how we want that to the outputs to look and what we what we need to be doing to handle the data as accurately as possible, and then implementing a model there. The next step is then to measure and reflect on that so that's where the qc process goes in and every time that we run a new model, we are looking at how our results compared to the last time we ran and putting in place checks on the certain number of fields to ensure that we're, we're improving. And then dirt after that measurement we can assess those and that's when we come back together as a greater team and we start to see okay what is what is happening here, what are the things that are standing out to us. And an example of this was that we found in one of the camps that they actually use two types of forms, and so then we are able to create a model that identified which form was being used and then based off of the results of that we put in place a different pipeline. So it's that sort of collaborative process that you would miss if you just use a standard model but with a custom pipeline you're able to iteratively develop and capture those. The results in is a high, we feel is a very high quality of outputs, and so at the base, baseline is we hand back to Baincroft structure data that's ready for research and we've set a threshold that we expect the field accuracy to be at so each field we're measuring how accurate are we and 90% of the fields are above the expected results there, and quite a number are actually much higher than that bar so there's just a few few of those fields that present unique challenges and we're constantly iterating to improve. And a quick back of the map napkin math as well in terms of the time savings with 100 over 100,000 of these forms. 15 minutes per form you start to add that up to over 13 years of people going through and of manual effort to transcribe these but in six roughly six months of work we're able to extract a large amount of that data out of there and so. This is saving a massive amount of not just manual labor but of time to get this to the community in an appropriate manner. And then of course through this we're doing things like adding custom flags in there for expected errors to help with the next step of that process which process which is to iterate on this data further and clean it and ensure that it's again up to standards. So I think at this point I will turn it back over to to Marissa to talk about the final conclusion here. And next slide please. Yeah just to give you a sense for you know what the results look like coming back there's just every snapshot and the CSV and JSON formats. And, you know with the pipeline in place with the pipeline in place, we get structured data back as Cameron mentioned, and prior to this we really only had very basic and very messy OCR text files for this, which as Cameron mentioned are searchable somewhat but don't provide the same sort of research ready high quality data for for users so again just sort of reiterating the benefits of this process. Next. Okay. Yeah, so I think you know we've learned a lot so far. The project is is not over yet. I think, you know the successes really have been built on leveraging our partnerships. You know, all the work has been iterative we've been working closely with doc CAI in developing again, you know, the work together and, you know, they bring that technical heavy lifting, and we bring that you know the domain expertise and it's been a really great partnership, but we learned a lot along the way. We also have partnerships that we're looking forward to in terms of data cleanup and working with other data sets that are either out there or being developed, you know for things like name correction and other kinds of data cleanup. Another part that we're really interested and working on at this at this moment is our our community co curation model so this is our community our advisory group meeting that we're holding next spring with members of the community who are going to come together and we're going to talk about responsible access and and you know you see Berkeley's developed some responsible access workflows as part of our digital lifecycle program and our Office of Scholar Communications is going to be part of that program to walk through what community members through what we have come up with as responsible access workflows. So we really want to look at engaging community involving community in this work as we do it so that we can be thoughtful and responsive to those whose data we steward and and how we make it available to others. There's a constant scalability of machine learning versus, you know our other former ways of extracting data. You know, we need to look at what we've learned from this project. You know the size and quantity of a resource can tell you whether or not machine learning is going to make sense. You know, is it a big enough set of records for it to learn from, you know, looking at things like content and structure. We thought we had forms that were all the same shape and size and it turns out that's really not the case because they modified form in the middle of World War two and things changed and things replaced it when they added a question and, you know, while they look very similar. Initially, when you're doing something with a machine it really does change and I think working with doxyclose lead to help, you know, us both understand that and then understand that I'm working with, you know, making their pipeline more dynamic has been a really great experience but you know again we're thinking about how much consistency matters when we're looking at using machine learning to extract data from an archival or historic resource. And then finally thinking about the collection is data side of this you know we want to extract data from the resources we're digitizing or the born digital collections we're bringing in. So collections as data principles really don't tell us how far to go and cleaning up that data, how, how clean does it need to be, you know, do we do entity extraction do we do other sorts of correction how far do we go and how do we document each change. So, you know, the question of what is research ready data look like has always been a big part of my, my research interest and I think as we do this project. That's another question we hopefully will answer at some level but I'm not sure if we will. So we're going to do to look at that as we look at different sorts of collections to apply this kind of a process to. So, yeah, we have a ways to go yet we hope we'll talk again in the future about our community advisor group meeting after that's taken place and then you know some follow up on the final project. But it's been a great learning experience so far. And, you know, we're really happy to answer any questions if, if anyone wants to reach out to us. Next slide. You have our contact information. So, thank you for joining us. We appreciate your time and please do let us know if you have any questions we'd be happy to answer them. Thank you.