 Hello. My name is Matthew Lincoln. I'm the Collections Information Architect at Carnegie Mellon University, and I'm joined today by my colleagues Julia Coren, our University Archivist, and Emily Davis, also an archivist with the University Libraries. We're going to be talking about a project centering the human expert experiments in computer vision infrastructure for digital collections management. What's not with us today is one of our colleagues on this project Scott Weingart. I also want to call out his important contributions to this project as a project manager. So first to give you an overview of our presentation. I'm going to be handing it over to Emily Davis to talk about CMU's General Photograph Collection and the particular challenges it presented that were the impetus for this project. We'll cover what the goals of this prototype project were over the a couple months that we did it. Then I'm going to walk through the results of the specific computer vision tasks that we did. And then finally we'll return back to looking at takeaways for archival management takeaways for software design and implications for future computer vision research and everything that we'll be talking about in this presentation. So here's a white paper of this report. You'll see the DOI at the bottom of the screen, where you can read in much more detail about all the things that we're talking about. So without further ado, I'll hand it over to Emily. So the General Photograph Collection consists of roughly a million photographs, million images, and they document nearly every aspect of campus life since the university was founded in 1900 to present day. The vast majority of the collection consists of photographic prints, 35 millimeter negatives, as well as born digital materials. And most of the photographs were taken by the university's marketing department. And we still get transfers of new images on a fairly regular basis from them. To date, we've digitized a little over 20,000 images, and most of the digitization has been driven by requests from researchers on various departments around campus, alumni, their family and exhibitions that we do here in the libraries. And unfortunately, currently, all of these images are not publicly searchable via our digital collections platform, and that's mostly because of compatibility issues. But for the past year or so, we have been working to migrate to a new system, and we hope to have some item level metadata to make these photographs discoverable. But as we probably all know, item level description at scale is hard. And so the collection right now it's currently organized at the item level so or at the level of like a single rule of film or a CD. These containers have fairly minimal description, and it doesn't go down to the frame level. And since the description is minimal and the contents of a single rule can be diverse, we don't want the images to inherit the existing metadata because then every image in a roll of 35 frames would have the same description. I'll give you an example. If we think of, we have lots of roles just described as commencement. And if you can imagine the activities that happened during a commencement ceremony are pretty, you know, buried anywhere from the president giving an address to people receiving honorary degrees into the commencement address itself from a visitor so it'd be better if we could describe all that stuff and more detail would be better for our researchers better for us to find things down the road. Additionally, many of the images in the collection are close matches and near duplicates as you can sort of see here. So if we were to migrate all of these similar images to our new system, we would really clutter the discovery interface, and we really don't want to make users way through eight or nine images that look, you know, pretty similar. Plus, if we were to migrate all these close matches, it would really increase the amount of item level metadata that we would need to create. So it was with these challenges in mind that we were excited to partner with Scott and Matt on the campy project to explore some solutions. Yeah, so the campy project standing for computer aided metadata for photo archives initiative. The project that we did over the summer of 2020. The concept for the project was a prototype implementation of a system and user interface. Again, this was a two and a half three month long project we were not building a production system. Our goal was to try out some software and come up with a deliverable that is high level requirements for what a production system would look like in the future. This is a formal report that discusses the specific effectiveness of methods of methods for similarity search methods for duplicate detection image tagging at scale. And then also something that we tried is doing object detection using commercial API's. So we got the question sometimes, why, you know, were we not trying to train a computer vision model to actually do this work to actually make the decisions about what photograph get tag gets tagged as what. There are a couple reasons for this first of all custom computer vision training for individual projects where you have known training data and known set of tags that you want to match things to. And then you train a model for that. This is a tricky fit with real life archival workflows. This collection in particular gets continuous new accruals every year. There's new digitization. And then there are expanding descriptive needs, the dictionary of terms that archivists might want to be able to add to these photos continually expands, which doesn't really fit with that sort of one off project base train the model and it's good to go forever. In many cases, the visual properties of these photos alone aren't usually sufficient to usefully describe them completely. And the computer only gets access to the pixels that it sees what we were looking for was a system so that archivists could work hand in hand with these tools, so that they can bring archival context, alongside the things that the computer does well visual search similarity and connection and combining that together to help prioritize photos for description, keeping the archivists in the loop rather than trying to outsource all the work completely to a machine. So, it's, we didn't train an entire new model instead we used what we call an off the shelf model a pre trained model. And for example we used inception v3 and also experimented with ResNet 18. It would not work very well if we wanted to use them to apply our own tags to these pictures, but they're good enough to give us pretty useful visual similarity search. The example you see here is a picture starting of our College of Fine Arts. The photograph here was an input to the query and then all the other photographs are from all across the collection, different years of photographs different roles of film that have been digitized, returned by this very generic visual similarity search. And so, even though these off the shelf models aren't perfect, they can be good enough for visual search they can be good enough when a human is in the loop. So, one of our first major tasks was this close match detection the problem that Emily had described of having a lot of these very similar photos all taken during the same shoot. Using the similarities detected by inception, and these other computer vision models, we could use an algorithm like DB scan to help cluster together potential close matches, and then using these again potential suggested by the computer. We built out user interface that could help the archivist prioritize quick review easy examination of details you can see here clicking into look at details is this actually the same set of students at the Carnegie Mellon's anal buggy race, or is this a separate set of runners. And during this decision making we also built a back end to then capture all of the editorial decisions that were being made. And doing this kind of close match detection was incredibly useful we found out almost a third of all the photographs that we had were in fact duplicated or very close match photos. Most of these were photos where we had to three photos that were very similar, but we had several sets of photographs, especially from things like faculty portrait shoots, where we would have dozens of photographs that were all very very close together, but we would want to be able to add on the same metadata and we want to be able to group them in a search and discovery interface. So this was our first step finding these close matches. Our next step was then to figure out how do we turn this into similarity a tagging. We wanted to be able to leverage our existing organization that organization at the level of a role of photographs. What we were do was build an interface for an archivist to select a photograph they knew was of commencement, which would then bring up similar photographs from across the collections. Starting from that basic visual search, our UI would then allow the archivist to go back to that original context you see in the slide out drawer on the right hand side, which will allow them to quickly look through the entire collection and in this case sorting out which photographs are showing student faculty processions versus which photographs are showing other parts of the event. Having tagged an entire collection, the archivist could then go back out to the similarity search restart the process. Pull up another similar photograph perhaps from a very different part of the collection. And then from there, begin the process again. This allowed again us to put together the similarity search to begin the process to help that initial prioritization, but then providing that original archival context, so that the human could make much more informed decisions within the context of where these photos were first taken using this process in a matter of just a few days, we were able to apply tags to almost 8000 photographs from this collection. And when we looked back and did an analysis to try and compare what kind of metadata were we able to add to these photos that wasn't already inside that higher level of organization description so descriptions that were on the levels of those roles of negatives. We found that we often had a substantial addition to that already existing metadata. And so by adding those tags, we were enhancing a great deal of metadata, above and beyond what we already had from that higher level description. Of course, if you look down below, you can see certain tags like football or robot would already be well represented. These collection titles, and so this was an interesting point for us to reflect as well on what were perhaps biases in the original description of this collection. That would happen by the photographer who was taking the photograph at the time, compared to how today, we might want to be able to tag these photos for access and discovery. The final task that we tried out was experimenting with a commercial provider Google's cloud vision API to try and see how well it's tagging entire photographs, or identification of objects within photographs could be at doing some of this work. On the one side you see a table showing the top 10 tags that we got. This is probably not much of a surprise but highly generic image description API is like Google cloud, provide labels that are way too generic for our particular context. We get black and white monochrome monochrome photography snapshot. This we know these are applicable tags and they're not going to help us sort out this collection. In a more detail in our report, some of these topical labels such as students teaching laboratory or theater did help identify some photos for consideration. But there is very poor recall, right so there were an awful lot of photographs of students that Google would not recognize as students. So we would suggest that this kind of public API, although very easy to use is not providing very useful data for public consumption. There was a full amount of review and would miss an enormous number of photos. That said, it could be useful as an internal search tool sort of first pass of adding some metadata that could help archivists and metadata specialists sort through the collection find photos that they weren't finding already, and then doing proper tagging on them. These are the results from the Google cloud vision API trying to tag the entire photograph. We found a little bit more success with object localization so trying to find what part of the photograph is showing something here you see all the neckties that it could find across our collection. We found that this was actually a much more discriminating. And we thought could make possibly for very engaging entry point into the collection. So while this is services like Google cloud vision API aren't a cure all for the kinds of metadata concerns we have. They are worth exploring, as long as users are aware of their limitations in this particular context. So let's sit back over to our archivist to discuss their reactions and the implications that this initiative had for photo archives management. Thanks Matt. For the archives part we were very pleased with the outcomes of this project. We need to start by classifying that with the elephant in the room which is that we did not have a dams for our photo collection. So prior to this project so accessing and identifying photos in response to reference requests or media reuse has been very difficult and very laborious. So just being able to pull up photos quickly, see them side by side was hugely beneficial for us. We also some larger implications for how this type of model could be integrated into archival archival work moving forward. First and foremost is the duplication of nearly identical photos. As archivists know the photo collections we are receiving are growing exponentially in size as our marketing departments and photographers have adopted born digital photography. Where in the past we might have gotten 36 images we now may get 360. And most of those images are highly repetitive. Some archivists have proposed using a type of random sampling to reduce the size of these collections. But Campy has demonstrated that there are ways to do that in a more knowledgeable and mediated ways where rather than randomly selecting two images from a portrait shoot, we can quickly find the best image and remove the rest from our collections. The visual search was also very useful for access and reference. It allowed us to identify new photos for example during the course of this project we were asked to identify the earliest photo of a computer lab on campus. I was able to locate a computer lab photo from about 10 years after it opened and then using visual similarity search was able to go back and find an untagged earlier photo of that same space. We also think that as Matt mentioned there are some potential novel applications for search using things like ties to give people new entry points to our collection. One kind of downside that we discovered on the archives end was solely on our shoulders not on the shoulders of the system, which was the need to think very carefully of the kind of tags that are appropriate for this type of descriptive work. We found that very specific tags were much more helpful than broad conceptual tags. Student activities was too broad of a subject term to use in this context, narrowing it to things like football classroom was much more beneficial. I think that this type of photo description could also allow for more non expert intervention in the photo descriptive process, allowing people other than archivists to contribute to this work as well. And I'm going to hand it back to Matt now. Thank you, Julia. We also had some conclusions about software design and user interface design, our observations and user feedback from our archivists and some metadata specialists and digitization specialists who help test out this project showed that again, these were expert users, and having a system that was more akin to what you might get in a 747 cockpit than the sort of simple interface that we might want to present to a public user. Our experts needed that fancy interface they needed ubiquitous and multifaceted entry points into the metadata. So they were there were points in sorting and filtering through the collections where they said, Oh, it'd be great to have visual search here, and also here, and I want to be able to pull up their archival context over here as well. And so we can only do so much within our limited time span of this initiative, but the feedback from UI design was incredibly useful for understanding what kinds of work would be very most useful to these expert users in the future. And what we do as well as data provenance, the ability, not only for our archivists to check the work that the computer had done, but also to check the work that each other had done or to go back and undo mistakes or undo changes of decisions that they had made. And so what we do as well would be having a variety of visual clustering and search methods we only implemented one here, but being able to search on things like overall composition versus more object based detection system could be very useful for certain kinds of photographs, or certain kinds of tags that they may want to apply. And this brings us to some of our implications for actually larger computer vision research. As people know, these off the shelf models like inception, like Resonate are built on a very large generic database such as ImageNet, ImageNet being a database of primarily born digital color photography. Of course, that's very different from the data set that we're using, which are black and white photographs coming from cultural heritage collections. There's a need in the field for these pre trained models to be coming from much more from. There's a real need for pre trained models to be coming from cultural heritage collections, historical photographs that might be using diverse photographic processes, rather than just relying on born digital photographs. Like orientation prediction, we'd love to be able to do tasks like classifying the physical source. Did this scan come from a negative versus did the scan come from a print, or being able to predict the photograph. Ultimately, though, we think that the biggest opportunities in this kind of work in visual cultural heritage are not just in creating fancier AI methods with fancier computer vision models underneath them, but instead in creating the user interface systems that connect intuitively with artificial intelligence, using just enough machine learning to help the humans bring their expertise to bear in an efficient manner. So to summarize our key findings, any computer vision project that's going to be happening in archives, libraries or museums really does need a functioning digital asset management system, as well as organizational information about its collection metadata. The automated indescription by itself is really bad, but generic visual search, when it's connected to existing metadata when it's being used in a human centered workflow, can really supercharge the kind of work that metadata specialists do. There is a field wide need for specialized computer vision training sets, not just using born digital color photos, but using historical non born digital photograph collections. To repeat, ultimately, that the user interface design for computer vision metadata systems is just as important and probably even more so than coming up with even more advanced and machine learning algorithms. So, thanks for listening to our presentation. You can find the white paper at this DOI and are happy to be contacted with any questions. Thank you.