 Dr. Joe Pugh is a digital development manager in the archives sector development department at the National Archives. He leads the National Archives work to enable UK repositories to contribute remotely to the discovery catalog, for example. His digital research has focused on interface design, interface evaluation, and a wide range of methods for exploring and representing archival collections. Today he will tell us about five approaches to enhance, rewrite and recontextualize archival catalogs in order to meet the needs of current audiences. His paper is entitled, Diversifying Databases, Five Easy Pieces. I'm really looking forward to the five pieces and I hope they are easy to implement by all. Welcome, Joe. Good morning. I'm over to you now. Could you please turn on your camera? Thank you. So I'm really going to rattle through this. I hope you'll bear with me, but five allegedly easy pieces in 15 minutes is not very long. So why are we talking about this? Why are we talking about diversity in the context of databases? I mean, a lot of people have made it very clear that if we're thinking about diversity and we're technical people that diversity begins at home and that, you know, we can't have inclusion unless we're thinking about our data as well as other aspects of how service delivery works and happens. And what does that mean in practical terms beyond what Dr. Francis has said here? What we want are systems where people can find records that are of interest to them using terms they can think of and understand, that nobody should be compelled to use offensive vocabulary as a search term in order to find records that they want to find and that catalogs should hold what we know about records. Now, that seems like a rather fundamental, you know, what else we put in catalogs. But actually, a lot of what we as archivists know about records we don't put in our catalogs and we're missing out context, things that are floating around in the heads of our staff members that are not necessarily encapsulated in our databases. So I'm borrowing from Stravinsky these these five easy pieces. So his his first movement is and Dante and this idea that we're going to try and make then the wrong words history. Now there are a whole bunch of projects out there now looking at how we do what is now often called reparative description. So where we we know that our catalogs are replete with words that we wish that they they weren't replete with. And many people internationally are doing fantastic work in this area. So Alicia Kilcott has laid it all out in her excellent publication. There's work going on at Libraries and Archives Canada, New York Public Library, University Library, sorry. The Tate have a tank project and the National Library of Scotland have just published an inclusive terminology guide which runs to over 100 pages. And at the National Archives, my colleague Grace Van Morrick is looking into this as well. And the thing about this work is that it is incredibly important, but it's also not sufficient. We have to do more than think about these the use of individual words. And that's partly because this work is often quite bespoke. So if we're working with an individual community to work out the most appropriate vocabulary, that is very time consuming. A lot of these projects are led by the archives and they're governed by policy. They of course focus on the most egregious words or some of them do. And of course terminology keeps changing. So this is this is great and important work, but it's never it's also never done because we're going to look back on it in a few years time and think, oh, how could we possibly have thought that those words were appropriate just as we're doing right now? So we have to think about what else can we be doing alongside these efforts? And I'm going to propose that one of the things that we can do is say that it's time for the catalogue to hold many voices. What Stravinsky calls the Espanola. So born digital records change the way that we think about metadata and they change the way we populate our catalogs. So Matt Hilliard has proposed in his excellent seven pillars of metadata blog post, which everybody should read, seven kinds of metadata. And a lot of these kinds of metadata come from different sources. So they come from the depositor. They come from a system that we're running over a born digital collection. They come from the archivist. They come from some other expert and needing to do this. And this is very archival that, you know, the reason for this coming about comes from some property of the records and not because of users. But that doesn't mean that we can't, you know, grab on to this and see the hold of it and get something valuable out of it for our users as well. So, you know, now that we're so interested in these different kinds of metadata and the provenance of metadata, that gives us the opportunity to do something different with description. So the idea that there will be one description for an object is suddenly looks like a very old fashioned idea. What we're going to have is catalogs that speak with many voices. I'm calling polyphonic description. And what that means is that we were holding multiple views on an object from multiple different sources. And it's very clear what that source is, where the provenance of information is from. So, you know, it might be an archivist writing today or writing a very long time in the past. It could be somebody else with a connection to the records. It could be an academic. It could even be a computer system. The point is, will be very clear when we're viewing a description, for example, from an AI or one written by a human being of various kinds. So to take an example from CO 1069, so from the Foreign Office photographic archive. You know, this is the description that we have at the moment, the escort who preceded me during my last visit to Hong Bori village note the trumpeter. Now that looks like an archival description in the sense it's in the catalog, but it's actually written by the person who took these photographs. Now that's an incredibly valuable piece of information about this document. You know, we know who took this photograph taken by Arnold Hodson. You know, the catalog doesn't conceal that. But at the item level, it doesn't make it very clear perhaps where this has come from and that this is written by the record creator and not by the archivist. But should the record creators statement be the last word on this document, I'm not sure that it should be. You know, we might add an additional description written by an archivist that will be slightly different. And we can add another description again written by the AI. This is by Microsoft's computer vision algorithm. It's not very good. You know, maybe back to the drawing board on that one Microsoft. Nevertheless, if we if we had a collection that we had digitized, but we didn't have descriptions for, you know, we can start to see that. You know, we can do useful things with these AI descriptions, even if we perhaps don't want to put them immediately in front of people. But we're developing the idea that we might want to treat different descriptions in different ways. And then the provenance of that description is, you know, is really obvious. And we certainly don't want to put a computer description in front of people and pretend that it's a human description and vice versa. So what this is going to look like in reality, how we might display these descriptions, I think is quite a vexed area. But in terms of holding the data. Sorry, this would be a very, very, very, very brief kind of exigent into code. But the idea is that you have you have a core description. So you say, you know, you have something at the bottom here that the we have P1 at the top here that's a specialization of P. So we have a description and we have we have these agents agents are going to be a big part of the new way that art that our system will work. So it'll be very clear who's created this description. And then here in this parallel description, some we have some other description, the agent is different, and it's a specialization of the same document. So this is how these descriptions sit in parallel in coding terms under the schema devised by our brilliant colleague Adam Retta. And the tricky bit is going to be, you know, what how what does that look like in a way that doesn't hopefully confuse people and makes clear that one description is not secondary to another. So there's no point in doing this if we're saying that, you know, okay, there are a bunch of descriptions, but the real one is the one that the archivist wrote, you know, that is not that's not necessarily the case. And, you know, one of the things that the user interface sitting on top of that can't do is to suggest that that was the case. So the third easy piece in Stravinsky is the balalaika. And I'm saying this is, you know, an opportunity to bring data together. So a new system such as we're working on at the moment has the opportunity to bring in data from other sources. So firstly, from the world of link data, which is one kind of community production, there are many hands working on something like wiki data. And then, but we can also offer tools within the system to support the creation of new meaning connections and arrangements. So we have, you know, the conventional archival arrangement, but we're going to have other kinds of arrangements as well. So wiki data can enrich what we know about record creators. So in the National Rated Archives, we're holding information about hundreds of thousands of record creators. But they tend to when we're when we're describing them, they tend to be one thing, you know, somebody is either an MP or a farrier, you know, they're not often an MP and a footballer. And this doesn't, you know, really capture the richness of firstly people, but it can also lead to some awkward deletion. So, you know, if somebody is a slave holder, for example, that's, you know, obviously not not all they are, but it's embarrassing for us to have to choose to call them that or call them a lawyer or call them an MP or some other aspect of their lives. So being able to supplement the data that we're holding with what exists out there already about individuals is very important. We have a lot of women in the National Rated Archives, my colleague Caroline Cachpoz going to be talking about this, who we're treating quite badly in terms of currently describing them as simply the wives of other prominent individuals, rather than describing them in their own right. And again, this offers us the opportunity to correct that wholesale. There are only two kinds of gender supported in the National Rated Archives at the moment, which seems rather old fashioned, to say the least. And there's the opportunity then sitting on top of this data to then provide richer faceting and searching. So to say, show me record creators from Nigeria, show me scientists who were also born in the continent of Africa. So that's the kind of additional functionality that we can get if we're sitting on top of data that we've pulled in from this additional source. And then thinking about rearranging what we have at the moment, even quite basic systems like tagging, of course, cuts across existing hierarchical arrangements. And they're a bottom up arrangement because you can have a system where anybody can come in and add some tags. But the taxonomies that we use also have the opportunity to develop. So at the moment there's no taxonomy running across the portion of discovery that relates to records that are not held at the National Archives. And as we try and consider what those categories might be, both in terms of what constitutes a set of categories and what sits underneath something like, you know, what is black history? And is that a useful term that we want to be using? We can collaborate with other organizations and individuals in order to make pragmatic decisions about what those things might be, or even to allow them to run systems that are views on our data in which they've decided what taxonomy is. So you might have different organizations running search across records in which different definitions of some of these terms are being applied. And then finally, we hope we have a tank bid in, but we hope we're also going to have the opportunity to allow people to form new connections between records and creators and the subjects of records as well. And subjects is always a rather awkward word to describe the people who are featured in records, but that's sort of the best that I can think of. So if you know, could you hear me? Yes, I can. Could you kindly sum up in a minute or so please? Yeah, sure. No problem. Thank you. That didn't seem like 50 minutes. I went very quickly. That's all right. I can do this very fast. Five is always difficult to do in 15 minutes, but basically we are also having to think about how we diversify what we collect. So we know that we're piling up a lot of the same old stuff and we should probably stop doing that. We manage your collections to try and tackle that. I'm sure you will have your own plans. And then finally, and this is appropriate, it is a gallop towards the finish. We need to push on with this work. So our audiences have been waiting a long time for catalogs, which are easy to understand. They've been waiting long enough and we should do something about it. And if this seems like hard work, Stravinsky said they were five easy pieces. I can't even play the piano. There's fantastic work going on on this topic globally. And we're in very good company when we work on this stuff and we can all learn from each other. And that's what I'm hoping to be able to be doing in the rest of this session. Thank you. Thank you, Joe, very much for that. I don't think they were easy pieces, but they were thought provoking indeed. I'm very excited. And colleagues attending this session today, please keep Joe's thoughts in mind and draft any questions you may wish to post and post them as well in the Q&A function. So now we need to move on to the next speakers. They are Dr. Rosica Atanasova, a digital curator at the British Library, whose key research interests I understand is around the creation and enrichment of digitized collections and the innovative use of digital culture heritage. She is a member of the digital research team that is building digital scholarship capability at the British Library. And she's joined by James Baker, who is currently a senior lecturer in digital history and archives at the University of Sussex, and also I think a director at the Sussex Humanities Lab. From September I've heard he will become director of digital humanities at the University of Southampton. He's also a software sustainability institute fellow and has worked at the British Library in the past. This is the intersection between history, cultural heritage and digital technologies. So today together, Rosica and James are going to present a paper looking into the computational analysis of catalogue descriptions based on a project that uses linguistic tools and approaches for the analysis of collection catalogs. So the floor is yours. Thank you very much. Thank you, John, for this introduction. We've got a joint presentation with James. And as mentioned already, we discussed a collaboration that brings together historical research using computational methods and the investigation of catalogue practice at cultural institutions. Within the framework of our collaboration, we're working with GLAM professionals to test scope of linguistic tools and approaches for the computational critical and curatorial analysis of collection catalogs. In our presentation, James, who is the principal investigator myself, the co-investigator will explain the background to our HRC funded project, which is a partnership between the University of Sussex, the British Library and Yale University Library. And we'll talk to you about the engagement with the GLAM community as part of the development of online training materials and as well as our future plans. So yeah, so the current collaboration builds on finding some an earlier project, which I just want to briefly go through, which was funded by the British Academy, which kind of demonstrated we thought the value of the use of these corpus linguistic techniques for generating new knowledge about what we called curatorial voice, which is kind of cataloging labor happening in temporal and institutional contexts. We did this based on a particular catalog I know well, the British Museum's catalog of political and personal satires that was written by the historian and curator Mary Dorothy George, often going back to Joe's point referred to in the British Museum's archives as wife of Mr. Eric George. Between 1930 and 1954, it was printed series of physical volumes and it's now available to search online in largely unedited form by the British Museum. It's about 1.2 million words if you download the whole lot. Our aim was really to combine kind of computational and archival research to try and understand the kind of features of this cataloging of these sort of legacy interlocutors between us and the past that cataloging describes. And one purpose of which really was to open up kind of new questions by the relevance of these kind of approaches for catalogers and curators. So could these linguistic approaches help the community better understand and contextualize their catalogs? Could it enhance access and discovery? Beyond squashing archaic vocabulary, which Joe spoke very eloquently about, could these approaches to legacy catalogs enable action that supported kind of social justice goals, thinking about the structure of language and how that passes through time? And so we had workshops, which you can see on this slide, which we invited a number of professionals from the glam sector, primarily people who worked on image collections because they have kind of more free text. We can work with, to validate our questions and help to shape the kind of follow on work. And we also drew pictures. And here's an example of us turning a description into an image and back into a description to really think about the choices that are made during cataloging an object. So on to the next slide, this then did happen, this kind of follow on work funded by the Arts Humanities Research Council in the UK under the long titled scheme, the UK US collaboration for digital scholarship in cultural institutions, partnership development grants and the partnership bits really important here. And what it does this this work is formalize the engagement between the cultural institutions we are working with before, so British Library and Yale University Library, specifically the Lewis Walpole Library and 18th century research center in the US, where they have comparable collections to the British Museum on printed image. Our real essential gambit here is that if working on catalogs can't stimulate digital scholarship capacity in cultural institutions then perhaps nothing will. And the project really has done four things to meet these ends. First, we've co curated training materials with the sector based on methods which we'll talk about in a moment. Second, we started applying our approach to other collection catalogs than what we've worked with previously. Third, we've moved from studying kind of historically specific voice in single catalogs to their transmission between institutions. So looking at how work from the British Museum moved to the Lewis Walpole Library over time the cataloging labor that is where the provenance is not always clear and trying to characterize the different types of transmission direct indirect style based if some of the mechanicals parody or subverting the original descriptions written by Dorothy George. And fourth experiment with technologies to create new routes to think about the impacts of legacy descriptions on the present. So if we make a bot that writes that uses a legacy catalog data to produce imaginary descriptions of imaginary objects which we have done and Yale are refining what normative assumptions about the 1930s are then amplified in the present what harms are kind of creating. So that's what we did through this kind of machine use of legacy catalog data. So last summer, if it's a conflict the next slide for me, we held a series of online training sessions, originally of course face to face didn't happen. Based on our methods and our data set which was the original British Museum data set we had work with the sessions guided attendees through undertaking corpse linguistic analysis of catalog data using a GUI tool called Anconk which those who do linguistic stuff with, and we really are intended to gather input and ideas from practitioners in the sector on how to make these materials relevant and usable in their work. We got some really lovely feedback, which we then used to, which kind of validated our approach but also gave us some suggestions for change, which we then worked on through the autumn of last year. We moved to more typical data set that was a product of many hands rather than the very kind of consistent vocabulary get from Dorothy George's work at the British Museum. And we put in more guidance on how to, for example, find variant spellings, and we put more working on next steps, for example, comparing catalogs. And the result then is this online tutorial we produced, which is hoated on GitHub using openly licensed data provided by the British Library around descriptions of photographic collections. It's based on the carpentries lesson template and pedagogical principles. And that's a long carpentries a long standing software skills training initiative some of you be aware of, and has a library iteration which I was involved in the formation of. So we think these materials are ideal for self directed but also instruction based learning, and they can be reused and reworked at your leisure and it was tested then a bunch of workshops which we held over the winter. So as a summary of our workshops, we had about 30 glam professionals from the UK, the US and other countries who attended the ad hoc training sessions and then in the December workshop. And here you can see James delivering our December workshop on zoom and some of the comments from the participants. Participants commented that the tool that we introduced them to was easy to set up a news that without any prior knowledge of the tool or the data they were able to follow the instructions and work to the tasks in the small breakout groups. They said that the workshops made them think about catalog data in a new way, and so they can easily pass on their newly acquired skills to their colleagues. Some were impressed with the speed with which the uncocked uncocked tool produces results. And with how quickly one could draw conclusions from the outputs once provided with guidance on how to read the results as James explained this is a tool developed for linguists with linguistic conventions in mind. So the enthusiasm about trying out this computational approach with the people's own data both as a starting point for working with new catalogs, and as means of providing new insights into catalogs, people knew well. In the first instance, professionals thought they would adopt this methodology to better understand the use of language in legacy catalogs, identify problematic terms as we've already heard. This data from errors, for example, errors introduced by automated text recognition software. Others said that they would be interested in examining more advanced linguistic patterns such as lemma turned in using materials, the materials as a starting starting point for learning more about corpus linguistics and natural language processing. One seemed to enjoy the interactive interactive element of the carpentry style lesson and found the materials to be an appropriate introduction to the tool and methods. This type of collaboration has been transformative in other ways. The British Library's involvement in the project stimulated many informal conversations with catalogs and curators at the library, who are transforming lengthy and complex catalog entries, as mentioned created by many hands over many years into standardized modern catalog records. Stuff we're interested in learning more about this new approach in order to gain a fresh perspective of the data they were working with and tease out the different voices and personal views in the text. The British Library Digital Research Team, which I'm a member run monthly informal hands-on sessions, are known as Hack and Yaks and you can see our logo here. The sessions give stuff the opportunity to trial new methods and technology and help skill building. So last November, as part of this project, this collaboration, we invited interested stuff to work with our online materials and think of how they would apply this to their own collections. These interactions led to a new bid submitted for the follow-on funding call by DHRC and NEH in the same strand, which unfortunately wasn't funded on that occasion, but which received very positive feedback and validation of our approach. And importantly, the activities of the project inform discussions around collection metadata strategy and standards, particularly those linked to the British Library anti-racism action plan, and complemented other activities such as the Qatar Digital Library analysis of terms in English and Arabic, which may be, again, offensive or problematic. And we, yeah, James, do you want to continue? I was going to add at this point that I think some of those informal interactions are some of the real value of doing a project like this for me. And in that spirit, for example, we produced what we're calling a provocation, which you can see on this slide in the middle of presenting legacy catalog data, really inspired by how news websites kind of flag old articles, mocked up against our partner catalogs, including here at the war pole. And you can I think one of the things that's going on here as well is that alongside other kind of there's other library materials here that we kind of unexpectedly engage with, which became a real value of the work. If we just move on to the next slide, we're also doing this to some with our partners at Yale as well, which I've not really talked about. And I'll just spend the last point, like last points thinking about here at Yale in particular we focused on analyzing transmission of text and style, as I said, from the British Museum catalog satires produced the beginning of the 20th century to the cataloging work that was that was happening the Lewis War Pole since around the 1950s. And much like we analyze the British Museum catalogs in previous work analyzing Lewis War Pole data really has thrown us back to the archive to try and reassemble the processes used to produce catalog data at the war pole in the 20th century. And this is really expanding beyond the life of the project to something I suspect Cindy and I will be working on for many years. Just as a final point while we think about the history of the war pole library, and as ever with going back to the archive, it is incredibly time consuming, both because of comic comic times, but also because of something Joe said around the fact that so much of the kind of practices of cataloging in in the 20th century, and even even beyond really are in the head heads of many catalogers. The final point while we think about the future, we'll just finish with some thoughts on next steps. We have been incredibly delayed in this project by covered but still been able to run a transit landing collaboration. And what we're going to do the next nine to 10 months is trying to do some knowledge exchange and partnership development to identify common areas of interest in the sector, but also among the research community who work with collection catalog data. The first is event in July to dig deeper into our outputs, including some hands on work and look at some next steps this will of course be on zoom and is more or less kind of full but I think we can squeeze people in if they really want to come. The link is there. We'll then covered willing be inviting colleagues in the glam sector technologists curators researchers to our final project workshop to be held in early 2022, which we hope to be another catalyst for change by positioning the kind of core functions of cultural institutions at the center of digital research and computational methods and by exploring how we can together work on the analysis of catalogs and to create kind of positive change for users and communities and for ourselves. And I think at that point I shall finish. Thank you. Thank you very much both Rosita and James for your presentation and for your research and for sharing the outputs and the workshops in the GitHub. That's wonderful. Thank you very much. I would like now to remind attendees to pop the questions for Rosita and James and the Q&A. I already have a couple of questions for Joe that so this is wonderful progress. So, it's now time to turn our attention to Roxanne. Roxanne is a university librarian and chief scholarly communications officer at the Australian National University. She has been president of the Australian Library and Information Association and has researched the scholarly ecosystem, particularly around digital libraries. Her experience has been transformed by disasters in the past three years. Floods, bushfires, health stones, building fires as well and obviously pandemics as everybody else, and that has given her time to rethink collections and access. Her paper at DC-DC21 brings us some philosophical insights and is entitled Remaking Collections in Times of Crisis, Insights from Epistemological Theory. Roxanne, we are listening to you. You're joining us from Australia. It's wonderful to have you over to you please. Thank you. Thank you. I'll just start my screen from the beginning. So I should say thank you very much and greetings from the southern land. I like to start by acknowledging and celebrating the first Australians on whose traditional lands will I meet with you and Airways, the Nambri, Ngunnawal people and pay my respects to elders past and present and also pay my respect to other first peoples. So we've been doing quite a lot of thinking about collections for a number of reasons, mostly to do with disasters at the moment, and really trying to say in terms of our library and archive collections, what is the purpose? What is it that this new world brings, these new opportunity brings to us and what is the actual strategy that we need to think about in terms of engaging with our university? So I really love this quote from Neil Gaiman. He talks about books being a way we commune with the dead, bringing lessons from the past, building knowledge, changing our knowledge through the culture and bringing those tales through the buildings and the ownership of resources that we have now. And that traditional concept of ownership and control is really being rethought in terms of what we do with access and we've had some brilliant examples so far today and I hope to add a little bit more richness in a different dimension. So why is epistemology important for us? So epistemology really is about the study of knowledge, who believes what, whose knowledge is it? And there's a lot of history and a lot of references at the end of this paper in terms of that ownership. We've talked about decolonisation as a way of, if you like, decoupling from the past and recoupling with a new future. But again, that raises the issue of who owns the knowledge and who in translating that knowledge into cataloging records is actually providing a new ownership and how dynamic that should be and what parties should be involved in that. Epistemology really has been fundamental to the nature of library and information science and really is, our catalogs are constructed to control that knowledge, but that knowledge in many ways has been controlled by those who created the collections. So by the publishers, by the documenters in archives and the catalogs in recording this knowledge really justify that within an orderly unconscious structure which is starting to become more visible because of the lens that we're applying to collections in this day and age, to understand them within their time, but to reconceptualise them within our time and hopefully in a way that won't freeze that again within a particular ethical framework. So within library and information studies we've really seen epistemology as a way of characterising our experience about saying that the knowledge that was the knowledge of the creators becomes the knowledge of the shared system through a new policy that we create and we've been criticised for being conservative by reflecting the collection building as the outcome rather than talking about the dynamism of the creation of knowledge and the recreation of knowledge and access through our record system. So when we've been really rethinking the collections we've been saying, well, what is our analysis of history? How can we take into account a new scholarly ecosystem, particularly one which requires the unleashing of knowledge within a digital humanities framework? So when we review the past we really know that, if you like, history was written by the winners, that particularly a lens in Australia now thanks to the work of Bruce Pascoe through his dark EMU book, Researching Archives, the records of early explorers that a lot of information about native people in Australia, their agriculture, their way of life, their culture was recorded but has been suppressed since then. So what does this mean for the control that we have and the way that history is reinvented? There's a lot of psychological research around the bias that sits around the nature of the written word about the recording of history. And how can we apply that in a new and different way to think about unleashing material that will actually revolutionise and open up both the education experience and research for the future. So we've done a number of things in the past about connecting history. So we've been part of the fragmentarium where I was going to say many of the works that were of value have been disconnected and sent around the world in different ways and can be reconnected digitally. So we look at histories, we look at recreating history in a new and different way. But there are many stories that really reveal that we are dealing in reconstructing with what has been a practice in the scholarly ecosystem that has suppressed many of the voices that we really need to have on the historical record, which raises a new challenge and a different way of thinking. So we've often thought that history was biased in terms of the records, but we're now living in a wonderful age where all is revealed, where we have fought the battles, we fought the battles of gender equity in the 1970s and we've achieved a new environment. Well, we are not there yet. And this example from the Public Library of Science, who we would think of as an ethical publisher, publisher really reveals the systematic biases, the systematic constraints on our scholarly ecosystem. So two female scientists, one of whom was a relatively early career scientist, submitted an article to the Public Library of Science. It had one peer reviewer who was male, who wrote, it would probably be beneficial to find one or two male biologists to work with to prevent the manuscript from drifting too far away from empirical evidence. You might say is extraordinary, but this is a sign that the world that we are creating records from, that we are adding to our collections is still systematically challenging. And how can we tell the story so people recognize what is within the system, what the constraints are in the system, and how can we unleash that in a new and accessible way. So we need to think about dramatic opportunities, dramatic different ways that we can make this sort of transformation in a way that hasn't been imagined before. For us, believe it or not, the opportunity to really reinvent and think about our collection within symbolic and political frameworks happened through a disaster. We were unfortunate enough to lose about 300,000 volumes in a flood. In February 2018, the university campus, the part of the university we sat on was the subject of a massive flood. The image on the top right is pretty much the view from my office in the JB Chifley Library. So that is the library that serves social science and the humanities and business and economics. So the flood came in, we had superbly efficiently stored collection material in the basement in the ground floor. And you can see that the water was around a meter deep and remained that deep for more than three days. February, summertime, humidity of around 60 to 70 percent, a classic situation for mould. And the inevitable happened. We were both unable to actually even get into the collection for three days. But by the time we could get down there, the collection was not salvageable and we had to make the decision to dispose of the collection in order for mould to not be throughout the building affecting the rest of the books. So what does that mean for us in terms of rethinking and rebuilding a whole new collection? Well, we've been at this now. I could talk about the insurance claim for years, but we got $41 million from the insurance claim and we have now rebuilt the floor to be one that is a study area only. So it is desks and we will no longer have collection material then. But we have been scouring the world's suppliers to try and find collection material. And after a very significant investment in every country in the world because we had material from around the world, we've been able to replace about 37 percent of the collection. But about 60 percent we think we will probably never be able to replace in print. So what does this mean for us in terms of looking for a corpus of knowledge? Well, it means we really have to take the opportunity of the digital world to say we are going to move on. We are going to create a digital corpus of knowledge and we're not limited by the selection we had before. We can invest in a new and different way. Some of the findings in terms of thinking big to create this sort of transformation have been discoveries around the parlour state of our metadata for historic collections. So we are doing a major project to reform that and reinvent our records to take a new world. In terms of the enormous effort that was put in past years to say we will have this publication and not that publication and have selective statistics to have selective material from individual authors. We're now able to say we're going to think and we will have all of the works that were ever published by the Australian Bureau of Statistics or all of a particular collection of British statistics and not be as selective. So that's a really important different way of thinking about filling gaps and making that available nationally. And we're also thinking in terms and experimenting with digitisation of material because we have to do the digitisation and we're seeing how we can work with Tesseract and Textract to actually do natural language processing and transformation of the printed word into digital humanities resources. So there are many challenges in this. We've been talking to the Australian Bureau of Statistics, for example, for more than two years about how we could obtain, they have it on microphone, the whole set of their publications and make it available digitally. Only to the university, not to the wider campus. They find that very challenging and I've got to say in two, two and a half years we have made no progress. We're going to have to go to a plan B to, with a copyright exemption, work with the National Library to do the whole historic range of those resources. Other publishers are somewhat easier to deal with and we've particularly been working with JSTOR and ProQuest to do the historic journals. But many of the works that we lost are less than 70 years from the death of the author. So copyright remains one of those impenetrable problems for us. But we are working on different forms of access control and particularly working internationally because we are a fair dealing regime in Australia, not a fair use regime. So we must look more broadly in order to find our solutions. So in terms of the metadata, we need to transform that and also think about the whole of data and how we can make that more accessible. We need to fix our records. We're collecting in new and different ways that hopefully will mean there will be less gaps and we are also able to use some of our flood money to actually get digital access to whole archives. So it's not just if you like the winners who write history but it is collections that are deep and archival that can make a research student's life and a researcher's life much more enriched in terms of the work that they can do, particularly given that over 80% of the material we lost was history and philosophy. So we're really having to think in a new way about corpus' knowledge, about records, about interactions with our clients and about storing and creating knowledge for the future. So in terms of working with national discovery systems, all of our records go into Trove but we've done a lot of work for the digital products in terms of optimisation with Google Scholar. And in terms of thinking about this epistemology, this rethinking of knowledge and our role as access rather than control and thinking holistically, really there's a lot of alignment with thinking about civic epistemology at the moment that's happening with disaster theorists and social science and science policy experts. So we're trying to incorporate that into the way we think about collections in new and different ways and to work very closely with our academics to understand the concepts that they want to apply around collections, digital corpus' knowledge and access control. So basically that traditional concept of locked knowledge carefully chosen where possibly more time was spent choosing material than was actually spent buying material is having to be unlocked and we're needing to think about our collections, our metadata, our services in new and different ways including linkages with creators, creating institutions and new and different forms of storage and access. So in my paper there's quite a lot of reference as well but because we have to be very careful about time I might finish now if that's all right Joan and hopefully we've got lots of questions. Thank you. Thank you. Thank you very much Roxanne for your insights and for sharing your experience and your approach and I'm flagging that conflict perhaps between intellectual control and access control which is a matter that concerns us all and it's at the core of what a catalog is and what a catalog is for. I'll stop talking now. Thank you to all of the speakers. I'll thank you again at the end of the session. I have a few questions for several of you and I'm quite glad that I have already free for Joe because I cut him short earlier. So without further ado I'm going to begin looking at those questions. The first one is asking us, Joe, does this mean that we have the end of control vocabularies? Is this the end of the control vocabulary? I mean it's quite interesting to know what James thinks about that. I mean I think I just think it's a myth. The controlled vocabulary is a kind of unicorn that we've never really, even when vocabularies are controlled, they're controlled within usually a small part of one institution and they're certainly not controlled really across whole catalogs or certainly not between institutions. So I would say yes, definitely. And a kind of mythical bygone age of control vocabulary is rather a myth and I'm not sure, therefore, that we should mourn it. Okay, thank you so much. If you can, James, yes, go. I mean controlled by whom, right? They're kind of, you know, in post-situated and positionally inscribed knowledges, right? So they have utility as so long as you, I guess the thing is going back to Joe's point about provenance is if you know people are using them, then that helps you understand how they use them, right? I think Hannah Turner's book on cataloging culture has a really good way of thinking about how these kind of knowledges are reduced was looking at the Smithsonian as a case study. And I guess I find them very useful and I have used them and I know colleagues who use them and in the work that we do computationally, sometimes they can be a really useful starting point for trying to think about, you know, whether or not a control vocabulary has been used at a given moment given there are lack of records to describe its use. But if it's being cited as something someone is using in a process, then it becomes an incredibly useful tool later down the line, I guess, for understanding what an individual collection or institution might want to change about the way a catalog has been produced. Thank you. I'm going to move on to the next. I have quite a few. A colleague wonders how open can the catalog be? Should anyone be able to come to a catalog, another description of the record, perhaps with a self declaration of where the perspective comes from and whose descriptions would get priority by default? So there's plenty to entangle in there. Yeah, those aren't. So there's a sort of implication there that people can't do that now. Whereas of course, anybody can write an email to the National Archives suggesting a catalog correction and Jonah's team complete, I mean, huge numbers of those in the course of a year. So we're already in a place where we sort of do that. And the question is kind of about mediation and moderation. Because clearly then there's somebody there making a decision about the content of that correction and how it's going to be deployed. So this is sort of about degrees. And I think it will be very dependent on the project and kind of what's going on. Because if we think about how much stuff there is in these catalogs after all, we're not suddenly going to be generating huge numbers of alternate descriptions for every item at once. There are lots of different strategies that are going to be being used in different parts of the catalog. And so it's a bit difficult to work out. So tagging, for example, Discovery already has a tagging system and anybody can log in and tag anything with anything. But at the moment that data isn't, it's run effectively in parallel with the rest of the catalog metadata. They're not really together in the same system. Now if they were, you would need certainly more oversight than we have at the moment. Because that separation allows that to be very, very light. And it wouldn't be completely appropriate to allow anybody to tag any record with anything. I mean, you can see that there are all sorts of potential issues there. So as these things get glued closer together, then there has to be some kind of examination of what's going on there. But there's still the opportunity to give away, I think, certainly more control than we have done up until now. On the question of what description has primacy, that's also quite difficult. It was very interesting what James was saying about kind of flagging legacy description. I think that's really interesting. And so you might just say recency. The most interesting description is the one that's been written most recently, maybe, if I was having to do it as one blanket rule across the system. But again, in practice, I don't think it will be like that. I think the different kinds of description will lend themselves to different things. And it might be that the archivist description, for example, is particularly good for findability and that another kind of description is more contextual. And so we might even decide that we don't want to show the archival description, that the archival description is sitting as a metadata layer that we're searching across, but maybe it's not the most important one that we want to show, first of all, for this particular record or collection. Very difficult to say because we're talking in very abstract terms, but that's quite possible. And thank you that that's very useful. There are lots of possibilities and issues in there. And I'm going to reframe myself because I do have views, but I'm just the chair today. So the next question is, again, it's for Joe, this person, thanks all speakers for the talks and would like to ask a broad point about the approaches described. So can the provenance of the catalog be retained so that as terminology is updated and improved, the historic terminology, even the offensive is somehow preserved? I ask the latter in terms of recording change rather than course in offense. I want to acknowledge the need to change and become anti-racist, for example, more inclusive, but wanted to understand how we captured the catalog in history of biases and omissions in the past. This seems to be important in the context as well. It is a mix of an observation and a question. And I think it could be asked by anybody who initially was sent towards Joe. It's about preserving the offensive original terminology and the history of the catalog in history of biases and change of metadata. So I would be very interested to hear what other people have to say. I mean, I would say that that is sort of what we're in the business of doing with this extra. Having these very problematic descriptions is much more problematic, I would suggest, when they are the only description. And as soon as we're able to hold them alongside other kinds of description, then we're able to put them in their proper historical context, which is to say they are a product of the past and we're saving them because the creator of that record has something to say about that record is not intellectually accessible to us now because it has value beyond what's in the record itself. And I don't think anybody's suggesting with reparative descriptive work in general that we're going to throw those things away. But you can see that if you've only got one description, that's a real tussle. But that's why I'm hoping that the framework within Omega, this way of holding versioned descriptions from different sources is going to be really valuable. I'll stop you there, Joe. I have a final question for you if you could answer quite quickly. Yes, is how would you then and visit showing these multiple descriptions to the user? I mean, it goes into the solution rather than the problem. So that's a very... So the quick answer is I don't think we know yet. We're going to have to test that with users, basically, to try and identify ways of not creating a confusing mess and making sure that people can really see. But again, I'm drawn back to that very handy prototype that James showed. So we can start to see what some interface features might be like to help us flag up that a legacy description is a legacy description, for example, and that an AI description is an AI description. But I think this is an area where we're definitely going to have to do some work. Thank you. Thank you very much. I have a couple of questions for Rosita and James here. The first one is about what are the key drivers for this type of collaboration? Why don't you go, Rosetta? So, yes. I think, as Jo has explained already the context, we've got a lot of catalogue records and we are trying to kind of re-evaluate those. We want to also ensure that the collections are more easily discoverable and we also want to develop new skills within the sector, within the library or the ground community. So this kind of partnership with James trying to look at catalogue data in a different way and trying sort of different approaches which give you glimpses into the data initially as a word list and then words within a context and then try the process of actually iteratively working with the data and asking questions and querying it and working together with the curator. So it's not just one person looking at the data, but it's the data being looked at with someone who is interested in archival history with a curator who knows the provenance and the history of the collection. So this kind of process is really, really stimulating and as well as us learning more about the tool and the computational analysis linguistic methods and turning that into some sort of training using the training materials we developed for the project, turning that into kind of using that for enabling others within the organization and within the community to do similar work and enabling the dialogue. So I think just having the dialogue with the different kind of stakeholders for us has been particularly interesting and James as he mentioned he had a previous project which kind of caught our attention so several of us attended his workshops where he was working with satirical prints and within the library we've got a lot of collections which haven't been catalogued and we want to see whether we could use some of those legacy descriptions and where the biases may be and what we should be aware of and certainly my collection metadata colleagues, curators in the library, staff at the library thinking very broadly about the systems and how appropriate our cataloging collection management systems are, controlled vocabularies as we've already discussed and just the kind of quality of our metadata. So how to, as Alan already, Alan Sadlo already asked in his question, how to flag up legacy descriptions, how to flag up any potential kind of historic biases, use content warning. So there were a lot of questions that are being asked that this project is a kind of vehicle for dialogue and bringing in, of course, new kind of approaches to, with computational analysis. So I can pass over to James because perhaps I kind of went level down from the big kind of drivers. I was going to address the next question to James or maybe he's able to make a point on the previous answer and answer this question as well is what problems might be tackled by using the project tools and methods and I would also add, what is the impact that you would expect? I mean, I think it goes back to something Joe said actually about not just looking for individual vocabularies. I'm not a corpse linguist. I'm not a big Andrew Solway, who I work with is a corpus linguist. But what he's kind of taught me is that, you can learn a lot about the structure in which people communicate and then you can look at other kind of comparable collections or similar things and try and see if there are patterns that are occurring elsewhere. And so for me, it's about things like trying to say, well, is a particular collection catalog working with being assertive, for example, or does it spend a lot of time trying to be honest and hedge the fact that it maybe doesn't know things. And if we can start understanding that, we can start pairing that with different types of vocabularies and whether problematic words and phrases come up to try and see if we can join that kind of worldview, perhaps it's coming down of someone being assertive. So the work we did with Dorothy George's catalog, for example, very much fits into a mode of early 20th century women working in sort of power academic institutions, being almost over assertive because of the ways in which they're working a patriarchal profession that kind of, in order to be listened to, they have to be very confident. But then when we start doing things like playing around with a bot and making a catalog, making sort of imaginary descriptions from that kind of data, we find very different things. So we found, for example, that it's very obvious when you start training a machine using catalog data that it does these weird quirky things that you've never seen before. For example, if a description says, starts with, say, a man, for example, or a citizen, you quite quickly realize there's all this kind of stuff loaded in there that's an assumption about what a man looks like or what a citizen is, which means if you're feeding that to a machine, then, and they're becoming the basis of some kind of AI futures, then there's a real problem in how those machines are interpreting what is normative. I know that Thomas Padilla was tweeting last night about his project. He's starting about ground truth about saying, well, what ground truths are cultural institutions using to create machine learning capabilities? And what do we need to know about these kind of, the formation of these ground truths, in order to think about the kind of impacts they're having kind of further down the line? Thank you. Thank you very much, James. I have a further question that could be answered by more than one. So I will look at your smiles to see who wants to take it. Do our collection management systems need to keep pace with our need to accommodate and document these processes, for example, the body trails, the provenance of metadata? And how do we progress this with the vendors? I think this is something quite important for the archive sector. Joe, James. I've said it very quick to say, as a historian, I would always say yes, keep more. But one of the really interesting things I've found when talking to colleagues at cultural institutions is that they often tell me things like, oh, our systems used to be able to do this. Like in the late 90s, early 2000s, they had kind of edit trails. And you could use them like modern GitHub systems to show you the trails of edits. And then when the vendors started producing more generic solutions, all that stuff went. And that feels like there's a lot of people trying to put stuff back. I know the National Gallery doing a lot of this at the moment, trying to put back in some of that history of the ways in which their collection catalogs have been gradually assembled. And it feels to me like I'm not in the sector, right? But it feels like the technical solution being offered by vendors have squashed things that people seem to want. Thank you. That's really very useful. I want to move on now to a couple of questions towards Roxanne, because I'm aware of the time. The first one is about copyright. What would be the effect of copyright on this sort of reformation of collections? So what I might do is just take the discussion a little bit further about who owns the metadata and who owns the context that we see it in and then work that towards the copyright issue. So part of the challenge in the addition of terms, in the control vocabulary we've used, is that it has been culturally biased. And the challenge for us when we're looking at our Indigenous collections is there's not, if you like one unbiased source of truth now, and if you talk to five people, you'll probably get seven different opinions about language, which is good or bad. And one of the things about dealing with the historical materials is we're often taking language that was very euphemistic. So for example, if you've got books or photographs of women who are arrested for vagrancy in Melbourne around 1900, that just means they were prostitutes. It doesn't mean they were vagrant and homeless, they were prostitutes. But because of the way we've interpreted different language over time, social conventions, your record itself has things that need to be picked. And one of the challenges or opportunities for us, I think, is to say, how do we have a layered approach? So we have the original materials with the original descriptions. We have an interpretive layer, which says, if you're using the term prostitute and you're looking at 1900 records, it will be automatically translated for you as vagrants. And then you need a third layer about individuals that perhaps learns in an AI way, what their special interests are and interprets that in a different way. In terms of copyright, I think copyright has been a really big challenge for us just in having a sufficient corpus of knowledge to do some of this research on. We're just doing strategic planning for our university, and I had to write a paper on the perceptions of the university over the last 75 years, because it's our 75th anniversary. And I could find fantastic material up to 1962. When the 70-year copyright hits, the rubber hits the road. So I could write all these wonderful stories about our first 15, 17 years, and then there was a bit of gap, and I could only get the material that we had digitised. That was our own record, so our own slightly biased annual reports, which were remarkably honest in those days. So I think there's some research purpose that we need to weave into new ways of releasing information, plus thinking about the layering of interpretation of records themselves. I've got more euphemisms. I could write a whole book on euphemisms. That would be a very interesting read. I do love euphemisms and synonyms and all of that. I have time for one final question, which is again for Roxanne. What is the advantage for the user when rebuilding collections, when you take that approach of operating at collection level rather than at item level? So, and I'll refer here particularly to our historians. Often they are omnivores who are very frustrated in terms of their knowledge appetite. So they want to play the music to take previous presentation on a larger scale, and they've been frustrated because particularly in the era of microphone, we've only been able to buy bits. But one of the great things about getting $40 million is all of a sudden you're thinking on a whole new scale, and you can get the whole collections, and then they can think more intelligently about the research that they will do, and see the library in terms of its value as significantly a partner in creating those new research opportunities. And it's a real infrastructure move to think of as a system and infrastructure, not just as repositories.