 This is Peter Leonard, and today I'll be presenting on neural networks, machine vision for the Visual Archive. What I'm going to be showing you today comes from the Digital Humanities Lab inside Sterling Memorial Library in New Haven, Connecticut. I'd like to acknowledge the work of my co-workers, Doug, Kathy, and Monica in many of the projects and screens you'll be seeing today. We've tried to organize this talk around three easy-to-understand use cases that get at some of the practical real-world things that machine vision might be able to do for those of us in the library field. And the first of these use cases is the following. Given a picture of interest, how can I see more things that are like it? This problem has motivated a couple interesting projects in the last couple years in the field of digital art history or digital humanities. The noted programmer John Resig has used a commercial company called TinEye's product called MatchEngine on both his own collection of Japanese wood prints gathered from digital collections across the globe, as well as the Ferros Consortium of Museum and Library Photo Archives. Now, TinEye MatchEngine is a commercial product, and so we don't have a lot of insight into exactly how that product works, but if you look at some of John Resig's work or the paper that's noted at the bottom of this page, you'll see some really terrific results on image matching in those two domains of Japanese wood prints and of reprinted, reproduced photograph from museums. Carl Stommer, who was then at the University of California, Santa Barbara, also did some really terrific work on the Early English Broadside Archive and specifically the Ballad Impression Archive, where his own implementation of a particular kind of machine vision was able to illuminate the re-use of these pieces of clip art in all sorts of Early English broadsides, and I encourage you to look at his paper down below where he talks about his own Archive engine. And these projects are really terrific and really inspired us at Yale to think about our own collections, one of which I've put on the screen here, it's called the Mazerve-Kunhardt Collection. It's perhaps one of the largest private collections of 19th century photography in the United States and really kind of chronicles American history from the Civil War to the Gilded Age. It's best known as a repository of famous photographs and depictions of Lincoln, but it's actually over 73,000 images in total that are being digitized in extremely high quality as we speak. Now, one of the things that's really transformed machine vision in the last couple years, specifically since 2012, is the use of convolutional neural networks for image captioning or image description. This is just a diagram of the way that convolutional neural networks operate, and in a sense they kind of mimic our own brain's system of vision. So the human vision system proceeds from very simple identifications of lines and simpler patterns up all the way through more complex patterns and agglomerations of different visual icons, and then finally arrives at a kind of realization or a description of an image. That's my uncle or that's a dog with a hat on it. And that's actually the same way that convolutional neural networks work. They work with very simple layers and they progress to very complex layers that culminate in descriptive labels like cat or dog or chair, bagel or banana. Now, these networks have proven very effective at image description. They're probably in use on your own Android or iOS device. If you take out the photos app and look for the word cat, you'll automatically see all the images you've ever taken of a cat, even though you've never labeled them, and what's happened is a convolutional neural network has labeled them for you. The problem though is that in this collection, like the Monserv-Kunhardt collection of 19th century photography, there are all sorts of wonderful 19th century subjects that don't quite map onto the labels of 2018. So the semantic field of 19th century photography being heavily studio based and incorporating all sorts of different types of images than images we take with our cell phone today, including things like stereograms and portraits of Civil War generals. It's just a totally different field, which means these labels and convolutional neural networks are not really appropriate. There just aren't that many images of bananas or bagels in the 1860s. We've been really inspired by a paper that came out in DH Cracow in 2016, given by Benwas again and a couple other of European researchers who were presenting a particular case study on the Cheney archive of art photography. But the main claim in their paper, which is listed on the bottom of your screen, is that the semifinal layer of these convolutional neural networks for image captioning, that is to say the layer before the final layer with labels, might actually be more useful for certain types of digital art history than the final labeling layer, which tries to decide whether something is a cat or a dog. And then to issue that Sagan and others showed in this paper is that because this second to last layer, this penultimate layer, is more abstract. It isn't as specific as cat or dog. It might actually be more useful to domains in which the precise identification of something is a bagel or not is not the main project. Instead the real goal of many projects is to determine visual similarity like we set out in our original use case. And so we take this intuition from the team that we presented in Cracow, and what we've done is we built a demonstration application called Neural Neighbors, pictorial troops in the Miserve-Kunhardt collection. And I'm going to just talk you through a little bit of how this particular program works. We take a convolutional neural network, which is designed for image captioning. We have in this case we use Inception, which is provided by Google. And this network is provided by Google in a pre-trained state. It already can discern what the labels are for modern contemporary images and Google spent a lot of time and money enabled to be able to do that. It's trained it on all its own images and it's successfully solved important image description challenges like ImageNet. We take that semi-final layer, that second to last layer in the network, which as we mentioned in the previous slide has a more abstract set of visual features than cat versus dog. In fact it has about 2048 ways of seeing that are impossible to describe but are useful in determining that something is a cat or a dog, a bagel or a banana. And in this high dimensional space, in this imaginary space with 2048 different directions or dimensions, we find the approximate nearest neighbors of each image in this space. And that's really all there is to it. And what I'd like to do now is to show you exactly how this works. So what we have on the screen now are just random images from the Miser of Cunhardt collection. There actually isn't anything important about each image. It's just that they're each from this large collection of tens of thousands of images. The important part comes when I move my mouse over this particular image here on the top left. You'll see that as I move my mouse over that, I see other images which are visually similar. They're men in army uniforms, they have brass buttons on the front, their jackets are dark, they tend to be on a relatively light gray background. Whereas if I move my mouse over this woman here, the second from the left, you'll see that when I move my mouse over her, I get images which are very similar to her. Hairstyles, perhaps some of the decoration around her neck, the sense that it's all sepia-toned, that I've got some elaborate hair. And this is true of actually all the images that I could move my mouse over. As I start moving my mouse, you'll see that I'm basically pulling up images in ovals when I move my mouse over an image in an oval, when I mouse over an image in a town with a steeple, I tend to see other images that include steeples and towns, people sitting in very particular ways. Now again, I can reload this page and I can look for other people sitting like this. I can look for, for example, other people outdoors in large crowds, people wearing military uniforms, people in sort of a very particular frame here. All of this is being determined by the semi-final layer of the convolutional neural network doing approximate nearest neighbors and finding the images that are closest to, in this case, this particular actor or this particular soldier. None of what's being used is the text. We're not using the text at all. Although we can get the name of the caption of the image if we need to, but that's not actually being used in order to process this. This is pure raster processing where we're taking signal only from the images that are in front of us and not from the description of the image. So that's neural neighbors. Let me return to my slides and just show you some of the more interesting examples that we found. This one I like to talk about in particular because it shows some of the things that the neural network is sensitive to and some of the things that it's insensitive to. So in this particular example, I've moved my mouse over Johnny Giles here in the bottom and what comes up are all the other 19th century pugilists, all the boxers that are in the collections. Notably, you'll see that some of these boxers are actually turned the other direction and one of them, John Banks, to my eyes, seems African-American. And so what's interesting about that is it's the robustness of this particular algorithm of seeing in 2048 ways and being able to use those 2048 ways of seeing to identify images which may be flipped or may contain a different type of skin tone or perhaps a different color of pants. It's an incredibly powerful way to surface images that resemble an image that you care about. So that was our first use case. How can we find similar images to an image of interest? For the second example of a real-world use case where machine vision might be useful, let me give you the case of how a large-scale visual collection might be able to organize itself. And that may seem sort of like a strange way to put it. We're not used to thinking of data as having its own autonomy. But what I want to explore in this particular set of slides is a tool that we've been working on inside the Digital Humanities Lab called Pixplot, which is authored by Doug DeHaine. And this tool actually starts very similar to how neural neighbors work. We basically process all sorts of images, tens of thousands of images, with the inception convolutional neural network, which is provided for free by Google in order to provide those labels that you might remember, cat, dog, bagel, banana. And what we use is the second to final layer. So we use the penultimate layer, which gives not those final labels, but rather 2048 abstract ways of seeing. This gives us a high-dimensional space, and rather than finding the approximate nearest neighbors in this high-dimensional space as we did with neural neighbors, what we're going to do is we're going to try to compress this high-dimensional space into two dimensions. I can't show you 2048 dimensions on my computer screen, but if I'm able to reduce the dimensionality of this high-dimensional space to two, then I have a hope of being able to show you the entire image collection organized according to visual similarity. We're going to use a particular dimensionality reduction called UMAP here, which stands for Uniform Manifold Approximation and Projection. For those who keep track of modern developments in dimensionality reduction algorithms, this is similar to TISNI, or T-Stochastic Neighbor Embedding. And just like TISNI, it tries to preserve global structure in the reduced dimensionality visualization, as well as preserving local clusters. This has become a little clearer once I advance the demonstration. We have a WebGL technology, which allows us to visualize tens, if not hundreds of thousands of images at the same time, and in the resulting visualization, similar images will appear nearby. So let me go ahead and show you what this looks like. What we're doing now is loading, in this case, about 27,000 images into my web browser, which is more than most people have seen, probably. And it's almost as if what I want to do is take those 27,000 images and throw them on the floor. But if I really threw them on the floor, then they would land pretty much randomly. They wouldn't land so that images with similar visual content were clustered around each other. But in fact, that's exactly what's happened. I've thrown these images on the floor, and all of these vignetted portraits have landed in this neighborhood here, in the bottom left of my screen. There are other neighbors as well. There's a cluster of buildings over here, let's zoom to the right, and finally you can see that all of these Second Empire buildings in the East Coast cities of the United States, oftentimes shot at 45-degree angles, are all arranged around each other, as if they were all found their own place there after being thrown on the floor. There are other examples of landscapes, so I can zoom down here, and you'll see that buildings go away. Instead, what you get are the natural environment, fields and mountains and trees. Finally, I have an example of maybe performers or people with particular military uniforms on. Let's zoom down to this cluster of military images down there. Let's zoom out a little bit and look at the entire visualization and just consider what this is doing. What we're showing you are 27,000 images, all at the same time in your web browser, thrown into a virtual space, and that virtual space actually represents a dimensionality reduction of 2048 dimensions down to two. This allows the entire collection to find its own organization, to find its own patterns of similarity, and allow you to explore the conventions of depictions of women, for example, in this cluster, by just zooming around and seeing how these women were represented in studio contexts. I think this is a really powerful way of investigating collections that would otherwise be way too large to visualize all at once. So that's Pixplot, and we also have an example of this, not only on the Civil War material, but on our own Yale Center for British Art, which has, in this case, about 31,000 images from our British Art Museum. This is the same notion as our Civil War photography. What I can do is I can zoom into these constellations and find, this is my favorite, a cluster of botanical drawings from the Yale Center for British Art. These are all leaves and flowers and other illustrations of the natural world that are being depicted here and that have all found their own neighborhood. They've all clustered around each other. This work on the Yale Center for British Art is just the same as the previous work on the Civil War photography. It's a dimensionality reduction of 2048 ways of seeing, down to two dimensions. And what that will do in this case is place all the vases in this particular neighborhood. And we have other neighborhoods. This is kind of caricatures of dandies bearing nice clothing in this area of my browser window. There's landscapes, painted landscapes over here, which are nice to see, being shown in my browser window over here. I have other clusters of, I think there's one of animals. Let's see if I can find my horses here, right here. And over here, we've got all these depictions of the hunt and of various animals in both oil paintings and drawings, pen and ink, all sorts of examples here that have been discovered. And that's what this work on the Civil War is hiding out at the top right there. This is, again, a thematic way of exploring all the material in the Yale Center for British Art, all on your screen, all at the same time. And it's a really exciting way of thinking about imagery at the scale of tens of thousands, which wouldn't before have been possible. The software Pixbot is available on GitHub, and we have some important directions we want to take the software going forward. We want to be able to link back to the museum or the library digital system that actually stores the formal record, which might be available in a high-resolution TIF, and, of course, has all of the important human-generated metadata. It's provenance, the genre, the artist, all of that information would be found on the system of record. We're going to link back to those. We're excited about making the screen resolution of the images even more precise, and we may investigate the IIIF framework as a way of dynamically pulling in even higher-resolution textures as you zoom into the visuals we intend to use the WebGL technology in order to animate between different states of this particular visualization, so maybe that we want to animate how different algorithms for dimensionality reduction position the items in two dimensions. We could also animate between different layers in these artificial neural networks that mirror how our eyes see, clustering and grouping the images according to more abstract or more precise definition of the image. We're more precise definitions and layers in that network. And most exciting, I think, we want to really investigate curatorial tools for this Pixplot software. We want to think about perhaps surfacing metadata facets, such as, is something by a British painter or an Italian painter, is something an oil painting, or is it a sculpture? By colored borders that might surround each of the tens of thousands of images, so you could show me only paintings by Titian and show me things in different colors that allow you to interpret the spread of images in new ways. We also want to consider tools that you can use to directly interact with this visualization. For example, a lasso tool similar to what you might find in Photoshop to make sub-selections of, remember all of those botanical images, if I could draw an arrow sort of line around all of the images that seem to be about botanical illustrations, I'd be able to create a sub-corpus that could be really exciting for people who are perhaps seeing these collections for the first time all on their screen. And something very technical you want to tackle is the notion of transfer learning on top of neural networks. So far what we've been doing is using the penultimate layer to avoid the domain-specific captioning that is too contemporary, that is too anchored in 2018. We want to be able to retrain that final layer, and instead of discarding it, use it to make judgments of what we're seeing. Are we seeing a confederate or a union officer? Are we seeing a military portrait or a portrait of a woman who's involved in the performing arts? Those are some possibilities with enough caption training data we can adjust the networks to be responsive to our own academic questions rather than discarding their final labels as we do now. So the third and final use case that I want to talk about in terms of machine vision for visual archives is going further down the path of the agency of some of these algorithms and very sophisticated graphics hardware they were using in order to do this. And that's to allow the agency of these networks not just to organize collections, but to dream up new collections that have never existed before. And the notion of creating human culture, given enough observations of existing empirical human culture, is something that people have been doing in the textual domain for a long time. This is an example of an artificial recipe created by a particular variant of a neural network called a recurrent neural network that has observed tens of thousands of actual recipes. And what it's done is built a model of recipes that can then be sampled from. We can generate new ways of cooking from the empirical observations of tens of thousands of recipes. So you take a minute or two to read this recipe what you'll rapidly discover is that if you actually try to make this recipe do, so there's no guarantee that the food is edible. But what's amazing is that the neural network has learned the genre of a recipe. It's learned that things have categories that they yield servings that recipes begin with ingredients which have amounts and units. And then there's also a sort of neurological bottom portion which includes instructions on what you should add. Again no guarantees about the safety of food safety of this particular recipe but remember that it was created automatically by a robot having observed tens of thousands of real recipes. So the question is what can we do in the visual domain in order to generate new visual imagery out of the dream of a neural network? And the particular technique we're going to use in order to do this in our 19th century photography has to do with generative adversarial networks which are actually two networks placed against each other two networks that are working at cross purposes. There's a forger network in a generative adversarial network relationship which looks at all these pictures of 19th century Americans and tries to dream up new Americans who've never lived. But there's also a detective network which looks at all of these photographs both the real ones and the counterfeit ones and tries to determine which are fake. And what's interesting about generative adversarial networks is that each of the networks learns mistakes. So the forger learns to dream more accurately when the detective successfully detects a forgery. And the detective learns to become more discerning when the forger manages to get a false image through its network. And what's exciting about this is it happens tens of thousands of times a second for days and days. And what you actually end up with is what I'm showing you on the screen here which is a fight between two networks one of which is observing 19th century faces and trying to create new ones and one of which is trying to discern the false from real. So what you'll see on the screen here can be thought of in a couple different ways. Some people look at these faces and see something which is uncannily like a real human face. Other people look at this and find something very disturbing. We have kind of almost goya-esque images with too many noses not enough eyes. But what's interesting about this is none of these people ever lived. They're real 19th century Americans. They're plausible Americans. One is that a result of a continuing dialogue, a conflict between two networks. One desperately trying to produce faces which pass muster, which pass as real human beings. And the other network which is dead set against preventing false imagery from passing a test for legitimacy. And what you end up with is a kind of fever dream of 19th century portraiture generated by these two networks each working at cross purposes from another with a based on an incredibly large data set of faces that it can observe from and dream up Americans who never lived. So with those three examples of real world use cases in which machine vision might be useful in the library context, I'll thank you and take any questions.