 Again, I'm Ryan Cordell, an associate professor in the English department at Northeastern University, core founding faculty member in the new lab for text maps and networks, which is our Center for Digital Humanities and Computational Social Sciences. Today, I'm doing something that's a little bit odd for me as an English professor, which is to say that I'm reporting on work that's already done. This is not what we typically do at our conference, as we report on work in progress. But I was talking to David Smith, my collaborator, and he said, well, this is what computer scientists always do. So it felt very normal to him. David is on sabbatical this year, and so he's not able to join me here. But I wanted to say that the work I'm describing was a joint project between David and I. You will see in moments in the report that there are definitely more elements of it that I authored. There are elements of it that he primarily authored. I'm going to try and translate as best as I can, all of that to you all. And then I also wanted to say Bill Quinn, who is a PhD student in the English department at Northeastern, was also very important to the work that we did. The project was funded by the Andrew W. Mellon Foundation and also supported by the new lab as we were working on it last year. So the report that I'm talking about, it comes out of conversations that we now know were happening among lots of different groups, including the Mellon Foundation, the National Endowment for the Humanities, and the Library of Congress. And essentially, as they were talking about a lot of the digital humanities and other projects that they had been funding over the past few years, they noticed that their grantees were reporting over and over again having some of the same problems with the optical character recognition in underneath the data for a lot of the text mining and other projects that were being done. They were noticing that not only were they were lots of project members reporting delays and even questions that they were unable to answer because of the OCR, but that a lot of the problems they were reporting were the same problems and that essentially this wheel felt like it was just being reinvented over and over and over again or couldn't be invented over and over and over again and lots of different projects. And so they were interested in writing a report to try and sum up sort of where are we? Where is OCR these days for humanistic materials in particular? What are the kind of common struggles that lots of researchers and others are facing? And what are steps that could be taken in the future to begin to address some of these common struggles? I will say the theme you're gonna hear over and over in this talk really resonates with a lot of what Kathleen was just talking about, which is to say a lot of this has to do with different communities talking together, working together who have not been in the past. So anyway, they reached out to David and I and I'll explain why they reached out to David and I and said who would be a good person to write such a report? And essentially we sort of self-servingly say, well, why don't we do it? And I'll explain why we said that here in a second. But before I go too far, I'm gonna assume that in this room probably most people understand what OCR is, but I also don't wanna just sort of barrel ahead as if everyone does in case there's someone here who's not quite sure what I'm talking about. But really briefly, I really like actually Rose Holly's sort of plain language definition of OCR. I think it's one of the best out there. My own attempt at something like that, optical character recognition is a type of artificial intelligence software, which is designed to mimic the functions of the human eye and brain to discern what marks on an image represent letter forms or other markers of written language. So essentially we have an image, a scan and we want to know what text is on that and turn that into computer processable text data. Typically OCR is used in situations where manual transcription would be too costly or time consuming, so often in large scale archival projects, things like Chronicling America, the Library of Congress's newspaper collection and in fact a lot of different newspaper collections, but also Hathi Chas, Google Books, big collections. Part of my interest in this is that a huge amount, essentially of humanistic research these days, relies on OCR even when humanity scholars are not always very conscious of the fact that they're using that data. Okay, so we tend really only to think about OCR, at least humanity scholars tend to primarily think of OCR when it breaks or when we notice problems with it and we don't tend to think about it or engage with it when it's working as we would expect or when we don't realize that it's breaking, I guess. So, some notable projects that use OCR, things like the Trove newspaper collection, I've already mentioned, sorry, there are notable projects that are trying to improve OCR, so if we think of the Trove newspaper collection in Australia, they have a crowdsource transcription element to that project where citizen scientists can sort of come to Trove, they can decide to manually transcribe certain portions of the text and improve the OCR. There are also things like the text creation partnership which is working to improve the transcription of early modern texts that were scanned, but a lot of these approaches we find don't scale as we might imagine, like the Trove newspaper archive, they've had millions and millions of lines of text manually transcribed, but when you look at those lines in comparison to the total number of lines that are in the whole archive, it's really quite a minute portion. And so, part of what we were sort of thinking about coming into this project are what kinds of things could be done at scale that might help to address some of these issues. So, yeah, in July 2013, 100 million lines of texts, for instance, have been corrected in Trove, which does sound like a big number, but across 21 million newspaper pages that ultimately ends up being a pretty small proportion of the total. All right, so you can see some of this, right? This is essentially what the Trove interface looks like for doing that correction. They do have these amazing like community things like the Hall of Fame to sort of recognize the people who are contributing an enormous amount of data to that. All right, so why did David and I, why are we interested? As David is trained as a computational linguist, primarily is in the Computer Science College at Northeastern. Again, I'm in the English department. A lot of our interest came out of the research that we've been doing together over the past seven years. The most prominent example of this is the Viral Text Project. So, I work on 19th century newspapers, and I started working with David on this project which aims to uncover the ways that stories and other texts were reprinted in 19th century papers. So, we use data mining to automatically find the duplicate sections of text across archives. We started with Chronicling America. We're now working with a lot of archives, and more recently that project has expanded to a kind of global scope. We're working with a six nation team to look at reprinting across languages and across translation using not only American archives, but also international archives of historical newspapers. So, a lot of our methods for detecting reprints are about the sort of fuzzy boundaries required to find matches. The texts were changed in the 19th century as they circulated, editors changed them around, and so there's some instability there, but also obviously when things are scanned and OCRed, the OCR changes the texts quite a lot. And so, if you were just looking for exact matches, you'd be kind of out of luck. So, the algorithms that we developed to do this kind of text reuse detection were foundationally about accounting for the OCR of the systems that we were using. And so, this is partly how we started to sort of think about OCR as an intellectual problem and how we got interested in doing it. This is just an example of the kind of matching that we're doing, this sort of fuzzy matching of text in these archives. So, this research that we did together sort of took us in two different directions. For myself, sort of trained as a book historian and someone very interested in sort of histories of textual technologies, I got quite interested in OCR as a kind of object of scholarly investigation. And so, I published this piece in Book History, which is specifically about like, how do we as book historians think about OCR? Part of my frustration that led to this article is that I was noticing that there were all of these talks given at DH conferences, at humanities conferences, where a scholar would get up and at some point would display a slide of dirty OCR, like I've just shown you. And everyone would kind of like groan and we would collectively shrug like, what can you do? And we would sort of move on. And I felt like there was a real block of imagination that was happening. People weren't thinking about, well, what could we do given with those OCR collections? Or what might we do to sort of help improve the situation? It was just a kind of like collective apathy that I was getting increasingly frustrated with, with that kind of trope that was happening at all of these conferences. And so, yeah, so I do this whole thing where I look at like the originals and how we get from the actual historical newspaper to the OCR version of the newspaper that we work with. I talk about the things like inking errors on the page and how those lead eventually to mistranscriptions in the OCR, so on and so forth, right? So this was my interest in it. David's interest branched off in another direction. So I said, we were doing all of this work with reprinting, identifying all of these duplicate texts. And what David started to think about was that, okay, with any given text there's error introduced by OCR. But if we in fact have identified that these 300 texts are matching texts that they're the same texts, that in fact we could use that duplication within the archive to do a kind of manual OCR correction, automatic OCR correction, right? Because the OCR errors will not be identical across all 300 copies. And so he began to experiment with how we could use the literary historical work we were doing to feed back into an OCR system and improve the output. So we both got interested in OCR in these really different directions. And this is why we were interested in writing this report. Okay, so in terms of how we approach the problem, there were a few primary areas that we really wanted to focus in on in thinking about like what the current state of OCR is and what the future might be. One would be like those newspapers that I was pointing to, right? OCR was mostly developed and designed to work with typeset business documents from say 1950. That's like the ideal OCR case. And if you're working with typeset business documents from 1950, it's real good. Like you get 98, 99% accuracy with a lot of current OCR systems. One of the places that many in this room will know that it goes wrong is when you have these historical documents that have typography that is unique. It is not sort of 20th century typography. There's physical damage on the originals. There's things like those inking errors that I was talking about. And then there's also sort of strangers of layout that are just distinct from those contemporary documents that OCR systems were designed to deal with. So this was one of our major areas, right? These historical documents that have these unique properties. The other would be documents written in languages that use systems other than Latin script, right? Most OCR engines that exist today were primarily designed for Latin script languages, European languages. But there are a lot of scholars who want to work with documents that are written in other languages. And there's obviously research around this, but like sussing out what that research looks like, who's doing it, how it's sort of coming back into these scholarly systems was something we wanted to find out. And the other would be multilingual documents, which is to say documents that have internal movement between languages. Most OCR systems that currently exist are not very good at switching between languages. They get trained on a particular language and they're not good at sort of this rapid switching. We found there was a community of scholars interested in these kinds of documents and OCR systems weren't handling it very well. The other part of it, and really a primary motivation for writing this report, and this came directly from the NEH and the Library of Congress and the Mellon Foundation and sort of articulating why we should write this report is a sense that there are lots of communities interested in or working with OCR, but that these communities are largely not talking with one another. So we have computer scientists who are researching OCR or, and this is important, related fields that don't get called OCR, such as computer vision. So there, I'll get to this a little bit more in a second, but OCR in the CS world is largely seen as not a really exciting problem and you'll find that a lot of current OCR researchers in computer science even perceive that their research is not highly valued, but there are actually fields derived from or doing work that I would recognize as OCR that are in fact quite prestigious right now like computer vision, like the algorithms that try and help an automated car read road signs and things like that that are identifying text in the world, but it's not sort of OCR. There are library and information scientists who are researching OCR or implementing OCR. There are commercial and nonprofit OCR developers. There are the funders in the humanities, computer sciences, information sciences and cultural heritage. And then there are the scholars like me that are trying to do text-based research derived from OCR collections. The libraries and archives that manage or hold large digitized text collection and then there are these scholarly societies and advocacy groups. And again, the scholars are sitting there bemoaning the quality of the OCR but they're not necessarily talking to the OCR researchers to see what we might do. There are all of these kind of cross conversations that we were hoping this report might help bring together. So the research process was essentially this, right? We started with sort of two concurrent modes of research. The first was we put together an online survey and we simply distributed it out to as many sort of pertinent lists as we could. Digital humanities lists, library lists, computer science lists, so on and so forth. We also identified a set of people that we wanted to talk to across the different domains. So people again in CS doing OCR research, people doing these kinds of text mining work, people who are managing these collections. And we reached out to those folks, we set out interviews. Some of the interviews were one-on-one. A lot of the interviews were interviews with teams who were either working on a project or managing a collection. From the interviews, we tried to isolate a set of the primary concerns across those communities with the sort of current state of OCR. The next stage of things is that we started putting people into groups and having virtual group discussions. And one important thing is that in this moment, we transitioned from listing the problems with OCR to articulating what are the interventions that would be significant to your field in terms of OCR. Because we found that people are super ready to tell you all the reasons why OCR doesn't work or is bad. It was actually a lot harder to get people to try and imagine what would help. And so the working groups were really about trying to get to that next stage. Once we had, the virtual working groups had met several times and tried to articulate a set of potential action items, then we convened a workshop of people from each of the working groups at Northeastern. This happened last year in February of 2018. While all this was happening, we were doing some experiments with constructing test corpora, collation, error modeling. This was the more sort of CS side of things to kind of test some of the ideas that were coming out of some of the working groups. And then the final part of it was that we tried to bring all of this together into the report, which we wrote in the sort of summer and fall of last year and was released in January of 2019. So just a few months ago. This, I wanted to get up here because this was sort of the big question that we put to the working groups. And this really formed how we tried to structure the recommendations in the report itself, which was just, we got a lot of encouragement from the funders to think in terms of like moonshot ideas was the way that it was often framed. So if you could imagine adequate time, attention, funding, what innovations in OCR would most significantly move research forward in your domain, right? So if there was this kind of concerted effort, what would make a difference? Any questions yet? I'm gonna move and start talking about the sort of individual recommendation, but I've been working in a lab with social scientists for a couple of years now and they like to ask questions in the middle of presentations, so I've gotten used to it. So I'm wondering if people have questions about the structure, like how we did the research or any of the kind of early stuff before I transitioned to the actual recommendations. Great, I guess. All right, so, the report is available online. I'm gonna try to not just like read from the report. We can have a conversation if there are things that are not clear. I've got a few little moments that I'm gonna pull out, but I don't wanna read everything. I just wanna say in general, if you look at the structure of the report, every recommendation begins with who that recommendation is intended for. Some of them are more aimed at the sort of technical side of OCR. Some of them are more aimed at the kind of like broad funding structures for the research. So we try and identify who is this recommendation for and then summarize what the sort of action items would be as we see them. The other thing to say, you'll notice as I go through these recommendations that there's an enormous amount of overlap, right? A lot of these things you would not imagine actually happening in isolation, but these different items would sort of complement each other. And we would imagine people proposing projects that would in fact draw from a few different points and in fact draw from a few of these kinds of areas. And again, I will apologize in advance. I'm going to do my best to summarize all of these, but there are a few that frankly get to the edges of my own expertise and understanding. I exchanged a few emails with David in the past week just saying, can you give me like the plain language version of this please? Cause that's what I need to communicate to people. All right. So improves statistical analysis of OCR output. So where does this come from? This came from initially what I was telling you before, which is that I am accustomed to seeing scholars get up and essentially say the OCR in this collection is too dirty for me to do the research that I want to do. And what we realized is we started to kind of dig into this that there's a broadly shared kind of anecdotal sense of this, but that there are actually not really great measures for knowing like what level of quality of OCR is sufficient for what kinds of tasks. Which is to say we have kind of decided that for keyword search, fairly dirty OCR is at least acceptable enough to put online and to have people use, right? There's dirty OCR underlying all of these huge archives and we've decided that keywords appear often enough that probably it's sufficient, but actually we don't have really clearly established guidelines for even that. And especially not for, okay, well, what is the actual impact of a certain quality OCR on someone who wants to do topic modeling? There's a sense that, okay, that looks too dirty, but we don't actually know like statistically, is it? Can you get anything of value out of OCR that's 70% accurate, 80% accurate? So a lot of this first recommendation is specifically about developing some of these kinds of ways of talking about or assessing what is good enough for what. So a few things. We want better models for kind of post-correction. I talked about David's work in using duplication within the collection to try and clean up the collection. Clean up a collection. There are lots of interesting experiments out there in the world about various models for post-correction that use dictionaries, that use whole sentences to try and evaluate the statistical likelihood that it's the right sentence and clean it up this way. So a sense of which of those models can be useful and in which collections is part of this recommendation. The other big part though is that we actually need some sense of what is the impact of OCR on downstream tasks, whether that is a keyword search, part of speech tagging, topic modeling, word embeddings, so that a scholar who wants to do the research can actually look at a collection and say this should be sufficient or this wouldn't be sufficient. That's something that is purely anecdotal at the moment. People kind of look at it and have a hunch but we need better ways of establishing that. And the other part of this is that we need better ways of communicating the kind of the statistical impact of OCR errors so that if I do topic modeling on a collection, I need to be able to say something about what the quality of the OCR might lead us to believe about the reliability of the results. Rather than just saying it seems kind of fuzzy, it would be good to actually have models for how to communicate that to other scholars who could then decide well that seems acceptable or not so that we can have an actual debate about these things. So things like we found that scholars might wish to be able to estimate the average error rate, the distribution of errors across documents in large collections where creating ground truth transcriptions is impossible or impractical. We think there are lots of post-correction models that can be used to perform unsupervised estimation of error rates and correction. Yes, and ultimately we think that one benefit of this is that it can also contribute to a wider discussion in the humanities around quantitative methods. I mean, this is obviously something that's been, any of you in the DH world know it's been a big topic the past week, but it's one of these things that we only have a very fuzzy sense about. And so this is our first recommendation. We want there to be something other than a fuzzy sense. I think probably the single most cited issue that we ran into as we talked to different scholars was the question of layout analysis, right? Which in some ways is a question that's prior to the actual OCR rate, which is just determining when you look at a document what are the areas that need to be analyzed and sort of transcribed so that you don't get, as you sometimes get in historical newspaper OCR, you get the OCR running across the columns rather than sort of identifying the columns, particularly when the columns themselves are not consistent across the whole page, things like that. So we heard again and again about layout, particularly from researchers working on non-Latin scripts, historical Arabic Chinese scripts. They mentioned the unusual layout of their print and manuscript sources as a particular area of challenge. And even researchers working on unsupervised OCR or computational bibliography, such as the folks working on the Ocular Project, if you know about that, mentioned that using simple existing rule-based methods for layout analysis were not sufficient to the kinds of work that they were trying to do. Scholars working on critical editions of ancient Greek mentioned the problems of dividing the page into text headers, marginal references, and so on. You think of these really complex pages. So essentially in this section, we survey a lot of research across a bunch of domains, machine learning, neural networks, computer vision, again, to mention what I mentioned what I said before, which are these areas of higher prestige in this sort of computer science world that actually have a lot of potential application in the analysis of document layout. And essentially what we're recommending here is that we need to be finding ways of connecting the scholars doing that kind of computer vision or machine learning work with the scholars with this really interesting historical document data. Because we found in talking to those researchers that they are just interested in interesting data. Interested in interesting data, yeah, that's correct. They want interesting data to work on. They think that a lot of our data is quite interesting, but the sort of channels of connecting these scholars are not well established. But if we could do that work, we think that there's a lot of really exciting research that could come out of that. Enormous potential, not much crosstalk. And so that's one of our big outcomes here. And to my mind, I'd love for some of the folks doing computer vision to be working on something maybe other than just like reading road signs and stuff. Like there's a lot of other interesting work out there in the world. Okay, our recommendation number three is exploit existing digital editions for training and test data. So one of the things that we spent a lot of time thinking about is that there has been, over the past few decades, in the digital humanities, an enormous amount of work in transcription and encoding and addition building. Things like the TEI, at my own institution, there's the Women Writers Project, which is this enormous, longstanding project that's been digitizing the work of women writers from the Renaissance to 1800. There are other projects like the Open Islamicate text initiative that are doing a lot of this work, but most of these projects when they do this work are not thinking about how the manual transcription they're doing might in fact be quite useful as training data for OCR systems that could then be applied to similar documents in other collections. They're not necessarily capturing all of the elements about sort of like the coordinates of the pages, where they're transcribing from, that would allow that data to feed into a kind of training model for an OCR system. But there's enormous potential because there are so many of these projects out there. And so our recommendation here, essentially, is that it's not just, hey, you should be doing this, but that we should be developing systems for it to make this easy, right? I suspect that many folks building those digital editions would not know how to prepare the data in a way that would make it useful for OCR training, but that in fact if you told them, if you actually just added these fields to your transcription, then this could easily be used to help build a model for early modern OCR, that they would in fact be very willing and excited to contribute to that. And so this is this sort of like enormous well of data that's already out there that we think with just a little bit of effort in fact could contribute really dramatically to improving OCR models for humanistic materials. Where are we so far? Anyone wanna talk about anything I've discussed thus far? All right, so this is a long talk, I feel like it's just been me, all right. So I just wanted to mention, we actually did some experiments with this. So the Richmond Daily Dispatch is a newspaper that is in chronicling America, so we have the OCR data for a long run of the Richmond Daily Dispatch, but there's also a project at the University of Richmond that took and manually encoded all of the issues of the Richmond Daily Dispatch between 1860 and 1865. So this is an instance where we have both an OCR version of the data and we have a hand transcribed version of the data. And we were able to do some experiments, basically demonstrating how building a training model on the transcribed data could then be turned to creating better OCR data for the other issues of the Richmond Daily Dispatch that were not hand transcribed. So it was a relatively small experiment, but shoot, I don't have the exact numbers here, but the OCR for the dispatch was able to be improved by more than, it was like 75 to 80% good before and we were able to get it up to 94, 95% accuracy by training a model on the transcribed version and then applying that model to the other issues of the same newspaper. We've done some similar experiments with sections of EBO and ECHO that have been hand transcribed through the text creation partnership. So that's another instance where we have both OCR data and hand transcribed data for the same documents. And again, we found similar kinds of gains in the OCR when you train a model on the transcribed text and then apply it more broadly to similar documents in the same domain. So we feel like this has a lot of potential. The other thing that's just worth saying is if we think about this model, one thing that it relies on is just how much text is actually in fact duplicated across these collections. Again, I'm interested in text reuse, I'll talk about it all day, but we've looked at things like the data in, we've looked at Google Books data, we've looked at Hathi Trust data and we find that huge percentages of these collections are duplicated through quotation, through title being held multiple times. So I think there's actually enormous potential in sort of exploiting duplications of various kinds across our current collections in order to improve OCR. Oh, I don't know why I put this here. This is the work David's been doing on using reprints to kind of suss out how to correct a line. So if you know, you can see, right? Again, lots of internal difference, but it's different difference that you can use to estimate what the line should actually be. So this is related. Okay. This is really closely related to number three, which is that we recommend that we need a way for people who have ground truth OCR data to contribute it somewhere where people doing OCR research could find it, right? So this is another one of those platform questions. Again, I'm thinking of Kathleen's talk recently, but again, we have a huge body of these transcription projects, but there's not a clear understanding of how you would get that data to the people who might make use of the data. And so here, what we're recommending is simply that such a system be there. Now, there are systems out there that are doing some of this work, particularly in Europe. There's the Transcribus platform, which some of you might be aware of, but in some ways that platform is not as community-oriented as we're recommending here. So with Transcribus, if you are a scholar, you want OCR, you can contribute the text, you get OCR back, and they keep the training data for the kind of work that they're doing. What we're imagining is something far more broad and community-oriented, where the data coming in and going out is all community-owned and used. But there is some work that we can look to and draw on as we think about this area. So again, we can look at some of the existing platforms out there for doing various kinds of community correction. We think that there's a lot to build on here, but the question is how to get this kind of training data that comes out of a project like this one, and make it more widely available to both OCR scholars and to Humanity scholars. But that kind of very sort of boutique training is not always possible, and that's what this recommendation is about. So model, adaptation, and search for comparable training sets. I'm gonna do a little bit of reading now because this is definitely the most David-heavy recommendation, and this is the one where I emailed him and said, give me language to use, please, David. So here's what he wrote, and I think it's pretty good. If you know ahead of time what text you'd like to OCR, you can collect training data for it by transcribing a certain number of pages. So we've been talking a lot about this sort of transcription. But unless you want to commit to manual transcription of a sample of every new text, there will always be new texts that aren't exactly the same as the ones which you've collected training data for already. And he's particularly thinking here about really big and diverse collections, where it's not just one domain, but it's lots of books across lots of different domains, and it might be difficult, in fact, to have a kind of hand transcribed training data set for every possible type of book in that domain. So he says some progress might be made on this problem by matching on metadata fields. So taking the metadata field about the genre, say, and trying to sort of link that up to a training model. But once you have hundreds of typefaces and book layouts or manuscript hands, it's impractical to have humans select the correct model for each book that you want to OCR. And so what he's proposing here is that we work on building automatic processes to select the most appropriate model given the text to be OCRed from among the models already trained. So here, imagining that we have some kind of repository of different models in different domains, then what we're suggesting is that there need to be methods for as OCR is happening across a large collection, being able to sort of look at a text, make some judgments about what kind of text it is, and then link that up to an existing OCR model to sort of dynamically switch between models as a large collection is being OCRed. Or he says building automatic processes to combine the results of several models for a given text. So one of the most promising areas that we learned about in computer science for building better OCR is to not run one OCR engine across the collection, but to run multiple OCR engines and then to essentially bring the results together, to have a statistical process to decide which of the results is the best one by a voting system or there's lots of different systems for that. So in this one, again, it's about building that kind of comparison into a more automatic process so that it's not just tailor-made for this collection or that, but can be applied more easily. Yep, and actually I'm gonna skip his next paragraph because I think it's a bit of a side issue. All right, so this is related to the multilingual focus that we wanted to take. So there are some really great projects out there in the world currently that are trying to think through the question of OCR for multilingual collections. So one is the Primeras Libros project down in Texas that was working with these documents from the early North American colonial period where you have Latin and Spanish and also Native American languages that are just constantly interspersed with one another. So in fact, if you were reading down this page, you would see that much of it is in Spanish, but in fact they transition to several Native American, different Native American languages in the space of just one page. And this is the kind of like radical multilingual text that OCR is very, very bad at working with or sort of contemporary OCR is very bad with working with. Over here we have another example where we have mostly German, but we also switch between German and Latin and French. Yeah, again, on the same page. And what's interesting here is you could imagine training a model that looked for the frock tour, this German typeface. There's an enormous amount of research and enormous comparatively in the sense that there's an enormous amount of research on any OCR topic. There's a lot of research on frock tour. This font that's used in German print until the 1930s, but in this case, like the German is in frock tour, but then both the Latin and the French are in Latin script. So you can't quite train the model entirely on the typefaces. There's also multilingual text even like within a particular typeface. So our recommendation here is that we're proposing that the builders of OCR models explicitly develop, train and test their models on text with mixed languages and a range of historical and dialectical variants. We're suggesting that the creation of annotated corpora in genres that are less likely to occur in modern collections like dictionaries, critical editions, commentaries, grammatical works be a kind of focus of some of this research. A lot of this is gonna have to fall to the scholarly communities. Like there are not gonna be OCR, like commercial OCR developers who are gonna care about this. It's not a big enough market. And so this is a place where the scholarly communities interested in these texts are gonna have to play a major role in doing this. And we think that some of these projects, early modern book and manuscript transcription projects in particular might benefit from following the practice that Primaris Libros has taken on of explicitly modeling orthographic variation and code switching within the data. And what we mean by this is we're suggesting that in fact, we need models that put the weight of the sort of interpretation onto the rare embedded languages. So even if this is primarily Spanish, it has these native languages in it. And essentially their model is to privilege the native languages in order to transcribe them more accurately, even though that does lead to a slightly less accurate transcription of the Spanish. But we think that we need more models that don't use the dominant language as the sort of primary language in the model in order to improve the OCR that's happening in these domains. That that's a trade-off that scholars might be able to sort of lead the conversation on because OCR researchers are not gonna probably lead that conversation. Okay. This is probably the most moonshot of the moonshot recommendations. But to summarize this, obviously the bulk of OCR research has concentrated in the sort of dominant languages and the dominant scripts. Because of that, we became quite convinced over the course of this research that there are domains where an enormous amount of progress could be made in a very short amount of time with just a concerted effort, which is to say if we could bring together teams of OCR researchers, domain experts who can help with transcription and building training data, and the kind of like in the infrastructure of institutions, that there are in fact a number of domains where we could see just like an entire phase shift in terms of how good OCR is for particular languages or particular periods. And so this is primarily a recommendation to funders, but we are essentially saying that the kind, we think of like the digital humanities institutes that get funded by the NEH, things like that. We think that this kind of like focus concerted effort would have enormous consequences. So we don't necessarily single out like which languages are domains because we could imagine it being either language like say Arabic or Chinese or ancient Greek, or it could be period, which is to say like 18th century or something like that. But what we're recommending is a kind of like challenge grant program that would be aimed at making these kind of substantial improvements. It's gonna require the domain experts because a huge part of making this work is going to be enough transcription like educated transcription work that would establish the training data to make it possible. And so within the period of these institutes, we would want the domain experts to commit to transcribing X amount of data, whatever was sort of sufficient data, and then apply that intensive transcription work toward the OCR of text within that domain. And the idea here is that like you're buying in essentially, you're committing this much labor in order to get an enormous benefit because like I'll contribute this many pages so that a collection of probably exponentially more pages can be scanned and OCRed with reasonable effectiveness. A big part of this recommendation is that, sorry, I keep losing my place. A big part of this recommendation is that we think it's going to require a bit more organizational work, perhaps even on the part of the funders, which is to say we don't necessarily think the domain experts know who the OCR people are, who they should be reaching out to. That became very clear over the course of doing these interviews and so on. We think there are lots of willing researchers who would gladly jump into such a project on both the humanities and the computer science side of things, but we think that either the funders or institutions are going to have to take a more proactive role in essentially matchmaking because we don't, we're not sure that this was just going to happen organically without that kind of organizational work happening. We do think that the potential payoffs are enormous, transformative, potentially for scholarly communities who are working with languages or periods that have thus far been little served by OCR development. And at the very least, we think that such institutes would promise to expand the potential for computational research into domains that have been largely also cut off from that, which is to say the digital humanities as largely not, or this branch of the digital humanities. There are large scholarly communities who haven't been as much a part of that conversation as it could be if they just had the data to work with. Okay, number eight, create an OCR assessment toolkit for cultural heritage institutions. So as we were surveying all of these collections, we found there are, you will be shocked to learn that the older a collection is, the worse the OCR is, generally. Not exclusively, but generally. And so there are a lot of collections out there that if we simply reprocessed all of the images in a collection using a contemporary OCR engine, that the quality of that OCR would improve dramatically. I know that you can see this really clearly in Chronicling America. The newspapers that were part of the first round of grants are the worst OCR, and the newspapers that were part of the last round of grants are pretty good, generally. And that also means, painfully for me, that some of the most prominent newspapers in the 19th century were the ones that they digitized first, because they were the ones that all the scholars said, we need to have those digitized. So some of the most important newspapers, in fact, have the worst OCR. The New York Herald has the worst OCR, I think, in all of Chronicling America. And it's a really important newspaper. What we found, though, is when we talked with librarians, when we talked with people who maintain a lot of these collections, essentially there was a lot of uncertainty about when it would be wise to do a kind of re-OCRing of a collection, what the actual payoff would be, how much benefit would I actually get from investing in such a thing? I think everyone sort of generally knew that OCR has gotten better, but how to evaluate a collection and determine whether it should be reprocessed with something that everyone was deeply uncertain about, and in the midst of that uncertainty, the decision to do that work is never gonna be made. And so our recommendation here is that essentially we need a toolkit, and preferably a toolkit that can be integrated into the kind of standard institutional evaluations of collections and things like that, so that it's just a part of work that gets done, as opposed to like an extraordinary extra thing that has to get done, that would enable essentially an audit of OCR collections that would simply say, here are the collections you have, using new software would stand to improve the OCR quality by 5% or 20%. And then institutions can make more informed decisions about when it might be worth that kind of investment. At the moment, it's all stagnant, and we think one of the main reasons it's stagnant is because no one really quite understands how to know what could be done or why they would do it exactly. So, all right, toolkit. This is the other big collaboration sort of moonshot idea, but there are all of these recommendations in the report that have to do with sharing data, sharing models, sharing best practices, technical best practices, and in practice, this kind of sharing can impose all kinds of additional burdens on existing organizations and projects. So again, I'm coming back to everything Kathleen was saying about humanities commons and sustainability, et cetera. So there are lots of projects and organizations that rely on software and computing resources provided by third party vendors, Abbey and other companies that provide OCR software, but those systems don't adapt well to the kinds of challenges that we lay out elsewhere in the report. And we don't necessarily want scholars to have to rely essentially on those vendors to sort of help form the kinds of collaborations and sharing that we're recommending here. We don't think that most current vendors are well placed to provide the collective benefits to the research community that we're recommending and all of the other recommendations in the report, which results in all of these little projects everywhere, reinventing the wheel every time that they need to do digitization or work with OCR materials. So our last recommendation is that we're proposing some kind of OCR service bureau that would be housed in a library or potentially related organization like a professional organization. We've had some conversations with like the Digital Library Federation and Hathi Trust and other about like who would even host such a thing or what it might look like. And this bureau would help with OCR training and evaluation with sharing of data and models across projects, report on evaluations, best practices, they would be the place you would go to find those, the collection evaluation guidelines, things like that by hosting data and providing continuity for OCR in these sorts of collections. It would allow for easier collaboration and even easier collaboration between commercial providers. We did find that Google's current OCR is pretty remarkable. There was one member of the Google OCR team that was part of this whole project. He was a, we interviewed him, he came to our workshop but as always working with Google, like how one actually makes that connection is a pretty murky thing. He was super enthusiastic but whether Google would be enthusiastic was another question. But we think the OCR service bureau could pool, would help pool some of these concerns that it's not just one individual scholar but that it's groups of scholars and that could potentially facilitate some of these conversations. Facilitate conversations among libraries that are undertaking digitization projects and then the researchers who are making, who are using these products. Essentially, what's guiding this recommendation is that we need a solution in which contributors have control over the data that they contribute or have some ownership over these contributions and the outputs, like what comes out of those contributions. And we need something that works at scale and we need an organization that helps to sort of make that possible. So, just sort of in wrapping up and then we can just talk together. After the report came out, I wrote this blog post which is addressed to my colleagues in English and history and other departments because I think that I'm very unusual in thinking so much about OCR. Most of my colleagues, what I say, I don't even know that they're aware, the extent to which they interact with OCR, the extent to which their research may in fact rely on OCR, and the extent to which we need both their expertise in addressing this problem in collaboration with these other groups. And we need their advocacy as well because if the people using these databases are not advocating for the kinds of changes that we would want to see, then they're not going to happen. So, in that, I'm gonna sort of end with sort of how I ended that piece because I do believe this. So, in the recommendations of this report, you will see that the most pressing research in the field is going to require extensive development of training corpora, accurate transcriptions of materials in domains that can be used to train OCR systems. In other words, the most pressing OCR research is going to require the expertise of humanities domain specialists, book and textual historians, scholars of languages and writing systems, scholars of particular genres or historical periods. And I quote, one of the people tweeted about the report and said, one sees an entire future of collaborative scholarship here. And I agree strongly, there's enormous potential in OCR research for meaningful, important collaboration across a range of fields, but particularly across the humanities, libraries and computer science. And for me, this is the central reason why humanists like myself should care about OCR, which is not to bemoan its current state, but to imagine its future. And that's what I hope the report will help people begin to do. So, now I hope people have questions because we have time.