 My name is Peter Leonard, I'm a work at Stanford University Libraries and I'm really excited to be speaking here today about something that I think is connected with what we're all reading about in the newspaper and what our students are talking about, our faculty, our librarians are talking about. But I'm only going to talk about one of the letters in chat GPT. I'm going to talk about the T, which of course, as most of us know, stands for Transformers. And what I'm really thought about in motivating this talk was thinking about all the excitement that all our patrons and our colleagues and maybe even our donor boards have brought to us around chat GPT. They've asked us questions like, why didn't the libraries invent chat GPT? How can we chat with a library? What are you guys doing? There's all this excitement, apparently chat GPT can write all of our papers, it can do all of our homework. But I do think alongside that wonderful service from the for-profit company OpenAI, there's so many other interesting things out there that are freely downloadable, that are transparent about how they were trained, that are retrainable by us or our colleagues on campus to do things that commercial models can't do. And frankly, things that we can run on our own computers, stuff like $900 MacBook Airs or $1,500 Workstations. There's a universe out there that doesn't really get talked about in New York Times headlines. So that's where I want to go. The current moment AI in AI, it's impossible to talk about it without talking about Transformers. And the math is way above my head. But I think one of the things that's sort of interesting is to think about how often this initial gets used. We all know it's a part of GPT, chat GPT, but it's also a part of the acronym for BERT and for Vision Transformers and many other cryptic acronyms we'll see throughout the rest of this talk. And the notion here is that this is kind of the next evolution in architectures, not specific networks, but network architectures. It takes us away from some of the previous ideas around recurrent neural networks and convolutional neural networks and tries to pay attention to many things at once. And we'll see that notion of attention in, of course, the most important paper that was published, attention is all your need, which sort of talks about moving away from local patches and trying to look at the whole thing at once. So when I talk about the whole thing at once, it could be a 30-second snippet of my voice recorded as a wave file or an entire line that I write with my hand or maybe a frame of a movie. The notion why Transformers are so powerful is they seem to be better at getting at some of the really complex relationships between audio and text or maybe image and text or other things we'll see. We found Transformers first in the space of machine translation where their particular characteristics seem to be good for certain things, but it's exciting to see how we're talking about Transformers not just in the domain of machine translation, but in a lot of other domains that I think were relevant for us in the glam sector. You can think of this if you're a faculty member or a graduate student researcher as being both mainly being in how we get complex forms of non-textual data into textual form, so that might be sound or pictures of handwriting or something or film frames, but also Transformers extend into the next phase and sort of TDM, so we'll see a little bit of that with BERT topic. For in the library space, we might say that Transformer models help us get some of our non-textual data into a place where we can do named entity recognition, we can put them into RDF tuples, but they also extend a little bit into that work itself. So we're going to talk about six ways in which Transformers I think are having an impact in the glam sector. We're going to talk about voice to text with whisper, we're going to talk about handwriting recognition, we're going to talk about doing topic modeling on embeddings as opposed to real words. We're going to explore text to image network with Clip. We're going to think about the notion of a conversation with an archive, which is something which has come up recently is still kind of I would say in the research domain. And finally, the notion of multimodal conversations engaging across the barriers of text and image and sound. And I'm guessing that perhaps of all the people in this room, whisper might be the one that most people have heard of. Can you raise your hand if you've used whisper or have heard of whisper? That's the one that's got a lot of purchase. It's a great tool. It is freely downloadable from OpenAI. So despite all my talking about OpenAI and their chat GPT product, it's amazing that they have open sourced and made available the whisper weights. There are many different types of weights that you can download for whisper, which is a speech to text model. There's versions you can run on an Android phone and there's versions you can run on a very expensive Nvidia GPU and you can make that decision based on performance or multi-lingual or accuracy or speed. It's probably trained in about 680 to 700,000 hours of text. It was probably trained on YouTube captions. That's the best guess. It seems to not necessarily do so well according to artificial benchmarks, but it seems to do really well in the real world. It's basically this encoder-decoder transformer. There will be no tests on these network architecture diagrams. The only thing I want to point out is how interesting that so much of our analysis nowadays happens to images. That strange blue and green square on the bottom left of this diagram is a mel spectrogram. Our tools for analyzing image data, in this case in ping form, have become so sophisticated that we turn everything that we're looking at, including sound, into a ping file and they just analyze that. When we talk about whisper, and I'm sure this probably mirrors a lot of the experience of the people in your room, you can look at artificial benchmarks or you can look against how it does against real humans. On the left here is an oral history from Stanford Libraries, which is transcribed by a real human as part of that oral history project. Really important human labor, domain expertise, bringing a kind of editorial judgment to this long-standing oral history project. On the right is whisper, and I think it's really debatable, which one is quote-unquote better. At the point of like, whisper is in some cases actually capturing more, such as the fact that the person said okay at a certain point in time. It doesn't get the particular abbreviations for Stanford's campus buildings, but that's okay. It's not just English as we know that whisper supports, and so that's an important, I think, intervention into what can often feel like an English-only situation with deep learning models. It's probably too small to see, but you can see everything from Spanish to Polish down to Galician and Arabic there. It's not every language that is spoken, of course, is on this list, but there are really interesting critical interventions that GLAM and other memory institutions are doing into the whisper models, and they can do those interventions because the weights are freely downloadable. So at the National Library of Norway, they're retraining the Finnish weights for the Sami language, the indigenous language in Northern Scandinavia, and those of you who have a linguistics background will know that that's the right move, that Finnish and Sami are both part of the Finnauguric language, so you can take advantage of the fact that Finnish is there and retrain it for a less commonly resourced language such as Sami. We have some of the Nuremberg transcripts at Stanford, human-generated German on the left, whisper large, doing German on the right. Some of this is actual punctuation differences rather than actual meaningful differences in whisper. So I think that's probably the tool that a lot of folks are already using. I know I've spoken with a lot of colleagues in other different institutions, and they're already deploying whisper at least in an experimental phase, making sure that it integrates well with their workflows and doesn't lose them quality. I will say that the most recent versions of whisper and specifically the acceleration with TensorRT can lead you doing like an hour of oral history in about two minutes, which is really stunning when you think about our backlog. One area in which I don't think transformers have had as much impact and they will soon is in the area of handwriting recognition. Primarily with the TROCR model. A lot of, raise your hand if you've ever used transcribous, the sort of EU funded Austrian, great. I love transcribous, transcribous is amazing in earlier versions of transcribous. Does anybody know how many pages of ground truth you needed? It's like 80 pages, right? So which was great unless you had an 81 page journal that you were trying to transcribe and then it was kind of depressing to have that requirement. What's interesting about TROCR and other transformers based HDR models is that they are a zero shot. They could work on handwriting that we've seen before. So you need zero pages of ground truth rather than 80 pages. And there's a lot of interesting papers out there that talk about this that say that this is a really interesting model that can be fine tuned but doesn't necessarily have to be fine tuned. The weights which are available from Microsoft, believe it or not, are freely downloadable. You can get them yourself and train them on a corpus in a particular hand. But what's kind of interesting is you may not need to. This is a letter from the founder of Leland Stanford Junior University, Jane Stanford, writing to a senator. And I can tell you that this had never been, it's not part of the training data for TROCR. So what we do is we take this line by line, we feed it into a GPU and we use the TROCR weights and we get basically a great record of what she wrote. This lines that she was writing on this piece of paper, I'm not even sure they're grammatical. I think she lost her train of thought because of the top of a new page. But it literally is to what is necessary to be done to ensure the success of the University of the Future. Zero shot on top of a hand it had never seen before. It's not only English, the model weights are English, but you can adapt it to other languages by sort of force aligning if you have a training dataset that you'd like to do. And a colleague of mine from Stanford Libraries has done this. And this is not the only transformer based HDR model. There's an interesting model or approach from the National Archives of Sweden called Saturn. Again, transformers, you see the T in there. They're trained on millions of documents. Interestingly, the Swedes found that they first wanted to release three models, one for each century, like I think 17th, 18th, 19th century. They found that performance was better when they combined all models together and released one meta-model. This is a letter written by Sweden's most famous author, August Strindberg. It's really hard to read. But it's not a problem. It basically says, I've read your horrible novel, is what August Strindberg is writing and that is actually what it gets. Let's move on from these two first models, which were basically taking human speech and human handwriting and trying to move those into the world of Unicode. And let's do something which is different, which is topic modeling. How many folks in the room have done topic modeling or have heard of topic modeling before? Yes, wonderful digital humanities technique that we borrowed from information retrieval. And then BERT topic. Have people heard of BERT topic? Kind of an interesting variation on that concept. For those of you who are familiar with word embedding models or vector space models, this essentially does topic modeling based on the word embeddings, not the words themselves, which is a really different strategy than what we would do with a tool like mallet with latent Dirichlet allocation. Really BERT topic motivates its work as a method of clustering and is because of that, BERT topic assumes the documents can only belong to one topic, which is a huge shift from how we would do kind of LDA work, mallet work. But because of that, because of that simplified assumption that documents only belong to one topic, it might actually be useful for those of us involved in cataloging or trying to think about subject headings or other forms of description. Because that's a modern software approach, all of this is, almost all of this is GPU accelerated. And really the key thing that BERT topic brings, although it has this entire structure in which you can swap out different parts, is that it brings sentence transformers to the model, to the party, which are amazing for short abstracts or tweets. They're not so great for modernist novels. And so you just like with traditional topic modeling, you have to think about chunking or breaking your document up. This is a approach on about 5,000 journal articles, and we can see this actually live, in which we created the embeddings, reduce their dimensionality with a UMAP, and then cluster them. And so what we get here is essentially areas of topics in the Journal of Scandinavian Studies. And you can see that there's topics here of articles around EU integration. Should Sweden join the NATO? Should it join the European Union? It's very close to other issues of European integration involving Vidcon chiseling, who had certain ideas about how Norway should join Germany. So you can see these topics are all here. We're not limited to two dimensions. We can compress down to three dimensions, and there we see other topics existing in kind of a three-dimensional space. You would see that there's kind of something about Strinberg and Bergman kind of in the same space, both in the sort of play and movie area. So that's BERT topic, and what I want to think about next is, I've just described how we might use transformers to model or create topic models of abstracts or academic work, but the really good question is, for us in libraries and museums, a lot of the stuff we have isn't text. We have at Stanford, we have like 150,000 negatives taken by Andy Warhol, or we have oil paintings in the Yale Center for British Art. And for that, we really need to think beyond text. We need to think about linked text-image networks. We would do this in the past with convolutional neural networks, which are sort of ways of describing cat-dog-vagle-monana on your phone. What's different about CLIP is that it is essentially a linked semantic space between linguistic tokens and pixel distributions. So CLIP is actually behind. It's one of the very important components inside Dahli, which is that thing which lets you type a picture of an avocado armchair or like an astronaut riding a horse. That's the generative AI side of it. This is not generative, this is sort of analytic. And so CLIP are essentially a vision transformer glued to a causal language model from sort of the GPT2 generation. But here's what's interesting about CLIP. If you think about these visual features and you think about a linguistic space, CLIP essentially unifies and projects these into identically sized latent spaces. Latent space here just means a high dimensional space of possibility in which you can go in a million different directions. So consider this notion of having a linked pixel in linguistic space. What can you do with that? One thing you can do is you can say, show me where the cat is in this picture. Here I'm not gonna draw a box or label the cat like I would with something like an ImageNet solver, a convolutional neural network. Instead I'm gonna show the activations in the pixels for the word the cat. That's not so far, that's not very interesting as what we can do with ImageNet. What's more exciting is to think about the word formal in a linguistic sense. Where does that express itself in a pixel sense? And as I've chosen an image which I know is gonna react, it's a white male in a suit. But the point is is that it still generates that interesting heat map around the word formal, not suit, not executive, but formal. This means that I can take an image such as this, I think it's a music professor in the bottom left and I can say well what is musicological about this? And you'll see that the heat map shows the keyboard and the speaker, that's music. But I can also say what's DJ-ish about this picture in a clip model? And you'll see that it actually focuses on its mixer board. Like that's what distinguishes a musicologist from a DJ in this case. So you could use clip to find cats as I showed earlier but I don't think that's actually the most interesting way to do it. I think what you really wanna do is to think about the notion of what clip enables and that's really kind of evocative search. What I mean by evocative search is here a screenshot from these are the FSAOWI pictures, the sort of 1930s New Deal photographs. And what I've done is I've mapped them all into clip and I've said give me images of a journey through the air. See the first image is like the wing of an airplane, the second one is a parish trooper I guess. There's a guy in a crane, there's a zeppelin, there's a kid on a swing at the bottom left. These are all journeys through the air. They're way more sophisticated than like airplane or parachute. So we call this sort of evocative search. You might be looking for something like furry friends and you might get chickens and dogs and things like that. You also might have to remember that what you're searching for can be interpreted in many different ways. So if I type a happy family together, I get humans but I also get ducklings and pigs and stuff like that. We have a live demo of this on actually some of Andy Warhol's photography and one of the things I like to do is to think about like words like speed. So here are images that score highly for the word speed in Andy Warhol's photography. It's not car, it's not like windshield, it's speed. Another one is the word love. And I think this is all very interesting, right? Like that's love in a way. There's love. I should be a little careful with Warhol. It gets a little NSFW. But there's a lot of forms of that in the image. I want to stop talking about Clip by saying one point which is that just because you have a network which contains a linked textual embedding and a pixel embedding does not mean that you escape the problem's bias. In fact, you have twice the bias. You have the visual bias and the textual bias. Although one of the things I think is kind of interesting to talk about is the ways that actually could be used to our advantage. So here, does anybody recognize the hungled characters here? This is for Bibimbap, the delicious rice dish. And we looked for Bibimbap using a multi-lingual Clip network. So it understands many different languages, many different writing systems. But we've only given it photos from Norway in the 19th century and there are no pictures of Bibimbap in Norway in the 19th century. But what triggers highly for Bibimbap in this very white European collection are either the kind of dulcet stone pops, the pots that you might find Bibimbap in or other notions of cooking, right? So we've given it this artificial example where we know it won't find anything. But what it triggers on is still sort of pixel distributions which are representative of the notion of Bibimbap in this multi-lingual model. I have not tried yet the word for opera in Han characters in a large collection of images from San Francisco's Chinatown. And the reason I mention opera here is that a lot of the pictures are actually of Cantonese opera. I don't performers and people in costumes. I don't think they will trigger highly for the English word opera because they don't look like Western opera. I'm curious if they will trigger for the word in Chinese for opera, I don't know yet. So we've got all of these interesting textual proxies for things that we thought would be very expensive or impossible to translate into textual form. We've looked at Whisper which takes audio in terms of the text. We've looked at TR, OCR and Saturn which try to take handwriting and put them into text. We even looked at the way that certain text image networks can sort of allow you to explore a collection that is totally uncatalog, undescribed textually. And this begs the question of can we actually have conversations with archives if we have all this textual data? And I think there's two points I want to make first before answering that question. The first one is that of course, language models are not knowledge models. And when we deal with chat GBT or when you have students or professors or whatever colleagues talking to you about chat GBT, that's a distinction that often gets elided. It's so good at giving us like vegetarian Indian food recipes that we start to forget that it's not a knowledge model, it's a language model. And linked with this problem is the notion that large language models often know the shape of the probable answer. And in many cases we'll hallucinate in order to fill that curve. So these are some problems when we think about what does it mean to have a conversation with an archive mediated by a large language model. But given that, it is something that a lot of people are working on. And nowadays, although the paper came out in 2020, I think people have only recently started using the phrase retrieval augmented generation or RAG. And there's a bunch of ways to do this before people started talking about RAG. There was a great tool, private GPT built on lang chain, lang chain continues to be a really important piece of middleware. If you're running a Windows PC within an NVIDIA board, you can actually download a tool called chat with a RTX, which allows you to do essentially retrieval augmented generation. So there are ways of doing this. And what I would recommend would be to use an open source large language model, such as Vakuna or Alpaca or Mistral. And then of course what you really want to do is you want to connect that large language model, preferably an open source one, to a corpus or a sub corpus, an archive that you or your memory institution has participated in the curation of or you think sort of represents a viewpoint or a collection or a time or a place that is meaningful. What you want to do is use certain layers of the LLM to do the question answering, but you want the facts, the answers to come from your corpus, not from like 14 year olds writing Reddit and Wikipedia. So what we have in Stanford is a really interesting collection of Silicon Valley history. And I took about 34,000 documents with about 15 million words and all of these documents were written between 1987 and 1997. And I wanted to have a conversation with those documents, not with the base model, but with the archive. So what we want to do is essentially create a chat GPT that was frozen in time between 1987 and 1997. And the best question to answer that, to ask that archive would be, will Apple be able to compete with Microsoft? This is what everybody was writing about. And the answer is it's not clear in 1997 if Apple will be able to compete with Microsoft. Microsoft has all this significant advantage in terms of market share and resources, which may make it difficult for Apple to compete. And that is actually exactly the right answer for 1996, 1997. But I've heard that Apple's hiring a new CEO. So how will that new CEO do? Well, Gil Emilio did a lot of great work at National Semiconductor. Spoiler, Gil Emilio did not save Apple, but he did hire Steve Jobs, which played an important role. So this proves that we are talking with 87 to 97. We are not talking to the base model. And this is kind of a weird, I mean, these two examples are kind of weird, but I think there are some examples in our corpora in which there's more knowledge in the archive than there is in Reddit and Wikipedia. And one of that example might be around 87. The hottest thing in graphic design was Bezier Curves and Illustrator. And these were famously difficult to figure out. John Mornock had to ship a VHS tape in the box of Adobe Illustrator, explaining how to use control points. And so what we can do with one of these sort of lang chain or sort of retrieval augmented generation models is we can ask it to give us information from the collection and then cite its work, tell us what articles it's using to explain Bezier Curves. The way it's captured, this very interesting interface is correct as a user of Adobe Illustrator. And more importantly, it's citing its work. It's saying, here's an article in December 87. Here's an article in February 88. It's exactly when people would have been talking about Bezier Curves. So that's the notion of retrieving sourced information from an archive rather than the base model, which was trained sort of on the open web. But you might say our collections include way more than old dusty computer articles. We have films, we have photographs, we have paintings. So how can we get these conversations into a multimodal situation where we're actually able to interrogate way more than just the textual proxy? And for that, we have to turn to multimodal networks. There's a lot of these out there. I've just seen in the last like three months an incredible explosion in multimodal networks. I've chosen one from the Alibaba Group, which is based in mainland China. But there are many different models you can choose. This one works in English. And there are some interesting technical tricks that it uses in order to basically understand text and image, sort of the tasks of image comprehension. And it's easier for me to show this to you than to tell you about it. Does anybody recognize this picture? This is the former president of Stanford who left his post, let's say. And one of the reasons he left his post was that he was accused of research misconduct by a student journalist, Theo Baker. And without passing judgment on what did not happen, he's no longer the president. And this was a famous picture of him as he was trying to exit a context where he was being asked questions by the student journalist. So we asked the M-Blog-Awa model, what is the attitude of this man? What's going on in this picture? It says it appears to be in a hurry. And his posture and the way he's walking indicate that he is focused on his destination and trying to reach it as quickly as possible. And this is true because the student journalist is behind him, trying to ask him a question. Does anybody recognize these people? These are former Apple executives. That's John Scully in the middle and that is Michael Spindler to his right. But what if we talk to this multimodal network without letting on that these are executives? And what if we ask the question that most of us have in our mind right now is pretend these people are starting in a region deaf metal band, what would their band name and their album title be? And so what this M-Blog-Awa model comes up with is that they're the forsaken executives and the first album is Echoes of the Boardroom. And the theme is corporate power and the potential destruction that it can cause. Now here's what's interesting. This was my real prompt. I didn't say what are these executives' first album called. It was able to read this picture and understand that they were executives and then construct the joke. Our art librarian, Lindsay King, at Stanford had a bunch of really interesting work that I think just came in that she acquired and she asked the multimodal model, well, I might somebody be interested in looking at this picture. And the model comes back and says it's a unique and creative representation of a woman playing tennis, that's true. It's done in a black and white style, not true, although it's not like a full photograph, so I'm forgiving it that. Talking about timelessness and elegance, it does mention that it's a dynamic pose, that it's holding a tennis racket and it appears to be the middle of a swing. It can evoke emotions. I mean, some of this is LLM cruft. You can all recognize the way chat GPT talks nowadays. But the point is it can see the picture and it can interpret it. This is a still from a 1971 film called Walk About. And I wasn't sure what the model was gonna say about this film. I could have said a bunch of stuff that would have been horrible. But I'm glad it only said this. It said that it was surprising to see a boy dressed in the sort of public school uniform in the desert, because this is unexpected. And that, I think, shows us the power of these multimodal models. They're able to find what's interesting or contradictory or unexpected in images. And I think my last example from Mplug Owl is this, and I apologize for the video quality. I was so happy this worked. I just, like an idiot, took a picture with my phone of a screen. What you're seeing is actually a video from Kareli Schneemann, who was a very important artist working from the 60s and 70s all the way through when she passed away. Schneemann's archives are at Stanford University. And this is a film. It's MP4, you can see at the top. It's a clip of a kind of installation art, a piece of conceptual art that Schneemann did and that we have a video of. And I didn't capture my prompt, but I think I said, what is happening in this movie? And it said that it doesn't get it 100% right, but it talks about a wooden horse bridle hanging from a wall. It has a leash attached, which is dangling. The horse bridle is surrounded by colored lines. It's done an amazing job of comprehending what is not an object like a cat. It's not an object like a skateboard or a bagel or a banana. It's able to kind of capture what's going on in this, even if it is using 100% of the right terminology. And it sort of ends by talking about the overall atmosphere as artistic, which I think is interesting. So in this talk, we've explored a whole bunch of ways that transformers can be useful for cultural heritage work, taking voice into Unicode text, taking manuscripts into Unicode text, doing topic modeling on embeddings, not the words, and using it all with GPU acceleration with BERT topic. We've talked about these text-to-image networks, which create these dual spaces of linguistic tokens and pixel distributions, what you can do with that. We've talked about the notion of conversing with an archive, using certain layers of the LLM for the conversational power, but then drawing the facts from an archive that we maintain. And then finally, we've sort of investigated this current state of the art in multimodal conversations with some of these multimodal models. And I think I want to end just by prompting some questions here. Thinking about, as a field of folks in the academic, in the library, in the museum space, what the right balance is between closed and hosted models versus open and locally modifiable models. Whenever we paste text into the free tier of chat GPT, what are we doing versus when we're devoting resources to, for example, training a speech-to-text model for an under-resourced language that we hold the recordings in the textual transcriptions of. This question then of should we actually be producing and not merely consuming large models, whether they're visual or multimodal or textual. The, I think, provocational question of should I be able to search Warhol's photography by typing the word love? Is that something that we want to allow? What could go wrong with that? But how, what would we gain by letting people type those types of abstruse questions into the search box? This question of once we get all this text out of the oral histories and the video recordings and the manuscripts, how will that change LLMs? If much of women's writing wasn't sort of printed, except in very rare cases until the late 19th, early 20th century, what do we gain by digitizing letters written by women and how will their language inflect LLMs? The future of image description in this world of evocative search, when not everything maybe needs a caption, and then finally, how far are we away from a world where we can just say, show me every scene in a Bergman film where there's reconciliation. That's actually a null set, but you could imagine other examples in the filmic tradition in which we can sort of query for these things and have them appear on our screen. And with that, I'll stop. Hi, Lisa Hinchliff at the University of Illinois. That was a really, really phenomenal, succinct overview. I'm curious, as you're thinking through the outputs that you're seeing here, is at what point do we sort of ask ourselves as well to what degree, if these things are being output, should they be also sort of captured, if you will, even one might say published, what is our role, or do you have thoughts, let's put it that way, about the role of, basically, you're pulling stuff out of an archive, you're creating something new. Is that an ephemeral object? Is that a moment that is just for the person in conversation with the archive? Or is there something that's about sort of like the republishing, the extending the archive in some way with that? Yeah, that's a great question. I sort of hopped right to the end user, but I could imagine if you were a curator or a librarian or an archivist who had accessioned or described or helped to purchase or whatever a large visual collection or textual collection, these tools would have curatorial function as well. At the Beinecke Library many years ago, I think Nancy Kool curated an exhibit on blue, just everything blue from the Beinecke, that expertly chosen, you know, albums called kind of blue and blue things. And you could tackle that through a whole bunch of non-machine learning AI nowadays. But I think that notion of it becomes really powerful when there's somebody using this who isn't new to Warhol's photography, but who's seen a lot of Warhol's photography, what would she or he gain from that interface? And could those then make their way into, for example, spotlight exhibit or something like that? Thank you for this fantastic talk and showing us a little bit of, I guess I wanted to say the future except for you're showing it up to us, so it's already there. The question that I have is around the idea of prompt engineering and sort of acquiring these archives and these collections in all the different formats that we have. And you showed some fantastic examples and I'm actually curious about your success rate because you showed the successes but you didn't show or at least I don't recall you showing anything where you're like, I tried a prompt and it completely failed. So I'm just curious if you can talk a little bit about how you fared getting to this point that led to this great talk. Thank you. That's a great question. I purposely, when I first prompted on the Apple executives, I said, these executives are starting a band and it gave me a really funny answer and I thought, wait a minute, I just told it they were executives. Can I go back and not tell it that they're executives? And I was happy that that worked. One thing I think that unifies the commercial non-downloadable closed source bottles of chat GPT and the open source models such as the Kuna is that they are non reproducible, non deterministic, totally stochastic. So you have this cherry picking effect where you're going to pick the response that you want to show. That's kind of like the XKCD comic we saw in today's keynote. I do think you're right about prompt engineering and there have been archives that I've made mistakes in asking questions of. You get not wrong, but partial answers when you imply a monolithic answer to a question. What was the response to historical event? And the fact is there are probably thousands of responses to a historical event, 9-11, assassination of Martin Luther King. So asking an archive like that, it's a great example of how it knows that it needs to say something about an outpouring of grief and sorrow and then that's not wrong. But if we prompt it differently and to say what were some of the varying responses to the 9-11 attacks? What were some responses from folks who didn't feel like their perspectives were reflected in the media? This also I think requires us to think about how comprehensive our archives are. How much are they representing voices that maybe weren't part of CNN or C-SPAN or something like that? But I do think that the best way to learn how to prompt this stuff is to try it on collections that people in this room know and maybe have spent a career building because then you have the best chance of figuring out is this actually giving me something new or is it just producing a probable answer? Carol, yeah. Hi, Peter. Can you hear me okay? Yes. Yeah, great. Thank you so much. As always, great to hear what you're up to. I have a question and it's really coming out of what we've seen around sort of the fury around AI, the, I don't know, the real desire for people to attribute meaning to LLMs, right? How, that real? And so as you're working through these, you're thinking about sort of, you know, people accessing this material, accessing collections. So we're thinking about user experience. We're thinking about people coming in to these collections and maybe getting it in through evocative, as you mentioned, search, which I don't know if it's yours or not, but that's a great term. And so my question is how do you envision knowing what we've just been through with LLMs? How do you envision kind of framing these kinds of searches for educational purposes, research purposes? So sort of how do we imbue this with the literacy, the information literacy that I think all of us endorse, right? It's a great question around that user expectations and the fact that people, on one hand people haven't had a chance to build up a set of expectations around LLMs, except on the other hand they have because they've been using it to do their homework or write their grant reports or stuff like that. But I agree as memory institutions as the glam sector, we have a higher standard to meet. There are all sorts of defective experiences in searching large cultural catalogs. I remember when I was at Yale and they were building the terrific Luxe product, the search across all Yale collections from natural science to British art, they were thinking about those questions. What happens if I type the words African American into the search and I only get pictures of enslaved people? That's not a great user experience. What are the ways that metadata can be used to remediate that experience at the same time not falsifying the historical record? I don't know and I think that it comes back to that issue around double bias problems in text to image networks. And then if you think double bias is a problem in which you have a multilingual model and what are you going to do about that? I think that it is an opportunity to explore and perhaps engage with stakeholders across many different dimensions, linguistic and cultural and socioeconomic, about how these things might work. And the only thing worse than the situation we find ourselves in now where we haven't explored that is a world in which vendors lock down that experience and it becomes even or more opaque than it is now. I don't want to hold anybody back from coffee so I'll stop there. Thank you.