 So this, as I've said, is a joint event between RUK and the Open Repositories Conference. And we see this as an opportunity to share reflections on the repository infrastructure. We have two talks. The first one will be delivered by Professor Hussain Soderman, whose research is situated in the Digital Libraries Laboratory in the Department of Computer Science at the University of Cape Town. Professor Soderman gave the closing keynote at the Open Repositories Conference in 2023. And the talk that we see today will build on that and is meant to really inspire and also to challenge us and to encourage us to think broadly about the ways repositories enable discoverability and interoperability globally. And then, teasing ahead for a bit later, the second talk will be from Stefano Corsu from Harvard University. Stefano will share key takeaways and updates on Harvard's Digital Repository Services Futures Project. The Digital Repository Services has provided preservation and access services for Harvard's Library and Archival collections for well over 20 years. And even though it was a really cutting-edge project when it started, it has over that time obviously aged and is now due for a redesign, which at the complexity and size is a major institutional challenge, affecting many departments across Harvard and very large external users and their data. So in the second talk that we come to a bit later, Stefano will introduce us to the approach that they've taken not just when it comes to technology and production practices, but also to the modes of collaboration and information gathering. And now to get out of your way and open the stage for Hossein, just as a brief introduction, he is the Dean of Sciences at the University of Cape Town in South Africa. As I mentioned before, his research is based in the Department of Computer Science in the Digital Libraries Laboratory. With a particular focus on digital libraries, ICT4D, African Language Information Retrieval, Cultural Heritage Preservation, Internet Technology and Education Technology. He's in the past worked extensively on architecture, scalability and interoperability issues related to digital library systems and also work closely with international and national partnerships for metadata archiving, including the Open Archives Initiative, Network Digital Library of Theses and Dissentations and the NRF, Chelsea South African National Electronic Thesis Project. His recent research has a growing emphasis on the relationship between low resource environments and digital library architectures. This has evolved into focus on societal development and its alignment with digital libraries and information retrieval. And he's currently collaborating with various colleagues in digital humanities group to develop a proof of concept and experimental low resource software toolkit for digital repositories. And with that introduction, I think I said I'd like to ask you to turn your camera on and then I'll hand the virtual stage over to you. So thank you, Torsten, once again, thank you to the organizers for inviting me to present this talk as part of this event. And I also should start by saying thank you to the various colleagues who posted where you are from. I would notice some people from my part of the world, but I noticed that people who are attending this come from all over the world. So hopefully we get some interesting discussion and some perspectives on this as we go along. So I have titled this Designing Repositories in Poor Countries. And this is meant to be a bit provocative. And the question that you might ask is, why am I focusing on poor countries and how does this actually apply to people all over the world? How does it apply to the UK for those colleagues who are from the UK? And I'm hoping that the beginning and end of this talk will somehow sandwich together the answer to that question, where going from a problem in one space, we want to try to address something that has more global applicability. So may I could probably skip why I am here and who I am. I think Torsten covered most of this. I've worked in various different digital library projects for a long time. I have to say, like many people, many computer scientists who work in this area, you start off by thinking that you know exactly what to do. And maybe after a few decades, you realize that you really don't know what you're doing, and this needs quite a lot of thought. And this is what I am currently doing. And maybe 10 years from now, I realize that we take that one step further. So starting off, this talk is very similar to what I gave at Open Repositories 2023. Well, to be more accurate, the slides are very similar. What I say will not necessarily be similar, because some things have changed since then, but not a whole lot. So some questions that came up with Open Repositories 2023 that I thought may frame some of the conversations here is Africa different. So I'm in South Africa. Are the needs of repositories or the needs of institutions different if you happen to live within some kind of a poor country? Or can you simply use the same solution like everybody else? Can you use DOIs like everyone else? Got some very interesting reactions to that particular question at Open Repositories. Or do we need different solutions? And this is a question that we always ask ourselves here when we are developing new technology and trying to solve problems that may be considered to be international problems. So I'll start off with what was a bit of a beginning for many people. Or maybe a watershed moment in this space of digital repositories. The Open Archives Initiative protocol for metadata harvesting. This was developed around the end of the 1990s and beginning into the 2000s. It was a standard that was developed to transfer metadata from one place to another. And a lot of people would claim that this was very successful as a standard. Some people would call this the last truly international archive collaboration because there were people from all over the world who contributed to this. I once read a beautiful report by the United States NSF where it was evaluating the digital library initiative of the late 1990s. And they said there were two things that came out of that that were successful and everything else was pretty much had failed. Google and OAPM Edge. Well, I'm not sure I necessarily agree with OAPM Edge part, but it was considered to be a success with lots of the things that we were doing. We're pretty hit or miss. So people were building repositories, they were building experimental systems and we weren't really sure what people were learning from this. So why did we come up with this OAPM Edge? The reasons for this are still pertinent to the discussion that we're having today and the discussion we're having today. The discussion we're having today is about reflecting on repository toolkits. So when OAPM Edge was was designed, it was spurred on by a couple of interesting things happening. The first thing was a change in the commercialization model of journals where we had large companies that were starting to increase the prices of journals. And of course, this is still happening, right? So what was called the Serial Crisis of the 1990s. And what people wanted to do was they wanted to achieve to build some kind of infrastructure that would help them to address the Serial Crisis. Centralization wasn't really working. So people were trying to think of different kinds of solutions. So the third thing here was the failure of federation. So there were some early attempts at trying to build decentralized systems where you could build archive in various different places and connect them together. But these didn't really work. We needed something that was somehow simpler, which was an interesting problem to attempt to solve. At the same time, what we also want to do is we want to divide up the space and say that, well, if you own the data, somebody else could provide services. So somehow we want to separate this even further. The notion that behind all of this was that we had very simple components that were connected together in order to constitute the repository infrastructure. This was kind of thinking behind OVIP image. So eventually the protocol that was released had a few attributes that I think are quite important because these are attributes that were influenced by what was happening in the repository space. And so arguably if this was successful, people should be carrying these attributes through into all kinds of other developments. So something that's flexible, generic, doesn't use a lot of resources. It's robust to failure. So if something fails, you can just go in there and basically do it again. It was designed so that it should work anywhere in the world. I know that I famously stood up a few times and said that particular sedition will not work in Africa, so we're not going to do that. And so we throw out some ideas that seem like they were good ideas at the time because in the late 1990s, we didn't have very good network infrastructure all over the world and this had to work everywhere. And then we had this little principle that we never wrote into any papers. But we said that the standard should be such that if you get a skilled software developer, they should be able to write the code to implement the standard in one day. And if it takes more than a day, then we've done something wrong. And this was the benchmark. So in fact, over the years, when this was being developed, there were there were various different versions and we could turn these out really quickly because this was all simple. Well, at least that's what we thought in those days. So what happened next? Well, we all know what happened next. People started building the positive toolkits and lots of people tried to provide services. There was the open access movement going on. They are open to positive conferences. There are events like this where we are talking about this. And lots of people are basically attempting to set up repositories for various different purposes. So there's been success in the repository movement to a large degree. But coming back to South Africa, so coming back to the example of the poor country. If you're in a poor country, is this all a runaway success? Like I was just suggesting. Well, I can't remember where I got this graphic from. But there are some numbers on this that are very important. I think the one that I really want to point out is that this particular figure says that in the year that this was produced, there was 22.39 million people in the labor force and 7.51 million people who were unemployed. This is a very high unemployment rate. And it gives you a sense that although the country may have maybe very good infrastructure for tourists, it is still effectively a very poor country. A large number of the people are extremely poor in the country. And this is possibly the strongest economy on the African continent. So the rest of the countries on the African continent have very little resources to spare for what might be seen as a luxury, preserving your culture or preserving the knowledge produced in your institutions. So what do we do about it? Well, there are some approaches that people have taken which I find quite fascinating. The first one is when we look at something like the Nelson Mandela archive and exhibition. This is a collection of documents, images, et cetera related to the first president of the Democratic South Africa who is known all over the world. And these digital objects were archived in collaboration with Google Arts and Culture with funding from international donors. I think Google actually provided funding as well. So the Nelson Mandela Foundation was able to go out in the world and say, okay, we have all this important stuff. Who's gonna help us with it? And everybody put up their hands and it was done. But this is not the case for all digital repositories. So there are lots of other things that are probably as important, if not more important, but for which we can't easily acquire the resources. So there were two examples that I like to quote. The first is something called the South African digitization initiative with a terrible acronym, which starts off with sad. The South African digitization initiative was effectively an attempt to build something like Europeana. And we resisted the urge to call it Africana, or South Africana. But it was supposed to be an attempt to build repositories all over the country, connect them together and provide information to people on all kinds of different things. And I managed to find the homepage of this project. It has nothing on it because the project didn't really get anywhere at all. And so this is the question to ask, so what happened? Why did this not work? The second example that I like to use is the South African national EGD project. The last time I gave a talk, this project still existed. At this point, I think the project exists as a concept, but the portal where you could in fact, access items in repositories is no longer active. This has gone dormant and somebody is gonna take over at some point, I hope, and try to resuscitate this. So what happened? Well, this is a classic problem that seems to occur over and over again in poor countries. Don't have money, we don't have time, we don't have the skilled people, we don't have the institutions that can serve as hosts for these projects. And we are lacking the resources to get things off the ground, to keep things running and to deal with it when something happens. So collectively, we refer to these as low resource environments. Now low resource environments can occur anywhere in the world. A lot of countries in Africa could by default call low resource environments, but they are places all over the world that can be low resource environments. There can be, you could be in the middle of New York City and there could be some NGO that's trying to establish an important collection as an archive, but it can't do this because it doesn't have the funding. So this is an all-pervasive problem. We have lots of these low resource environments and how do we deal with this when we are trying to build repositories? Well, there's a lot of problems that can happen. The first thing here is, you could have archives that are shut down. Like I was just talking about the South African National EGD portal. It's not really the archive, it was the portal, but it's pretty much not operational at the moment. It's even worse. There may be thesis archives that have been set up in some universities and the university decides to stop funding it and the archive disappears one day. Then there are archives that are severely at risk of being shut down where the rock art data archive, for example, just narrowly managed to keep running by getting some last-minute donation from somebody so that they could keep licensing the software they were using. And then there are notions of digital divides. So people who may have archives, and this is a criticism that's often leveled at people in South Africa, that we have archives for various things that we are doing. You go to our neighboring countries and some of them have nothing. This is a problem. Then another problem we have in those such environments is we rely a lot on external providers. So at the institution that I am at, now I'm at the University of Cape Town, which according to some people is the top institution on the continent. But do we have resources? You could say we are one of the better resourced institutions. However, we don't have the staff who know how to run a repository. So we use FMIF, one of our repositories, and we use FigShare for the other one because there are no people here who know what to do to set up and manage these things. And this is a bit of a concern. And then of course we have the case where there may simply be no archive. And this seems to happen quite often. So how do we deal with this? What do we do to somehow adjust this problem? I'm not a big fan of custom solutions. I think what we need is open source software maybe, but does the open source software actually work? So is it okay to just say, well, you have all these software tool kits, the open source, you can use them. I know a lot of people who have tried to use them and they were told, well, it's open source, so you can change it any way you want to. But they didn't have the skilled personnel in order to change this. So I've spent some time trying to deal with these things. And I think at some point in my life set up repositories using all these software tools. And I found that in many cases, these things are not really that easy to use. It's not that easy to adapt it to your specific circumstances. And maybe it's just a bit too complicated. So is there possibly a better way of designing this? How are these tools designed? What principles did they use when they designed these tools? I'm not convinced that they were strong principles underlying the design of most repository tool kits. And this has led me to do the research I've been doing for the last five years. So one of the things that I started off working on was something called the Bleak and Lloyd collection. And the Bleak and Lloyd collection was a project that I worked on from way back in 2006 where this was a collection of documents related to the original, the history of the original inhabitants of the area that I live in now. And these documents were absolutely critical to the history of the country. And somebody had decided that, well, we need to distribute this far and wide, make sure that this thing is going to survive whatever catastrophe and that everybody has a copy of it. So we wanted this to be distributed on a DVD that we could hand out to lots and lots of people. And it turns out that very few repository tools will allow you to do something like that. And in fact, lots of the software approaches to building a collection as something that can be distributed on a standalone DVD didn't really work. So I spent a lot of time experimenting with alternative ways of building digital repositories as part of this Bleak and Lloyd collection. And then you notice that other people had run into the same problem. So Greenstone, for example, was attempting to create archives that could be distributed on CD-ROM. And as I looked into this further, I found that, well, people that I had known about for a long time had been thinking about this from slightly different angles, like Project Gutenberg, where there was a decision taken that everything was going to be in a very simple format. And then there was the LOX project that said that, well, let's assume everything's going to go wrong at some point in time and we make lots of copies and distribute this and maybe that will help to preserve information. So putting all these things together, I spent some time trying to do analysis and look across all of these different solutions to see what was common. And in my research group, we came up with a set of principles. And said, well, maybe we should be designing systems on the basis of a set of design principles. For example, we only create a system, as much system as we need. We don't have a complex software system at the back end. We don't want users to have to go through all kinds of additional steps, like logging into a system, unless they absolutely need to. And I have to say that many online systems today are doing this. The third question was something that's very specific to poor countries. It should work even when you're offline. But as much of it as possible should work when you're offline, et cetera. So there's a few of these design principles here. I'm not going to talk about all of them. The one that I want to particularly focus on is this notion of preservation. At the end of the day, we want to have some sort of digital preservation. And I think this is the starting point. We want to preserve the information. But I think that quite often we think about how we can plan for preservation. I think we want to invert this and think about what do we do when disaster strikes? So how do we design our systems so that they can be easily rescued? Because I'm not yet convinced that any solution anybody has come up with for digital preservation actually works. What I think is more likely is that some systems are going to be rescuable to a greater degree than other systems. And so I think designing for rescue is probably the better approach to take. So there are some technical solutions. And I'll just mention some of these quickly. In trying to build on these principles, what we asked was, okay, now that we know what we're aiming for, how do we build actual systems? The one thing we have played around with for a long time is not having a database. All that you store is XML files containing metadata and other things in hierarchical directories. And there is no central database or database management system. The question is, how much can you get done using this approach? It turns out you can do a lot without an actual database. The second thing is, do you have to run any kind of software when somebody is trying to access your repository? Or can you just pre-generate everything? And this comes from other software systems that have done this before, like even the e-print software that was produced in the UK had this notion that you could pre-generate parts of your system. And what this does is, it reduces the resource requirements for the user who's trying to access your system. It makes it more scalable. It makes it easier to rescue. So this seemed like something we really wanted to do. The other thing that we tried to do, and I think was quite successful, was we decided that we didn't really want the server either. So could we move, as any processing that had to happen, could we move that to the user's end and have that operating in the browser? And as time has gone by, what we have in our web browsers is a fairly advanced execution engine. This is no longer just something to look at HTML pages. This is something that can run programs that will do some processing for you. So we can do a lot of the processing at the user end, which means we don't have to do it in the server end. So putting these ideas together, we proposed that if we simplify the design of archives, it may address a lot of problems. It will be less costly, faster, can be maintained easily, et cetera. We want simple archives. So I've spent the last five to six years now working on the principles that are embodied by a toolkit called SimpleDL, which is a toolkit to create simple archives. And there are lots of ideas that have come out of digital libraries research that are somehow embedded into this particular toolkit and it's experimental. So there's a GitHub URL there, anybody can go in there and grab the code and play around with it. It's meant to be something that we're not necessarily going to put into production on a wide scale, but hopefully something that can help us to think about the design of repositories. And I will talk about some of the features of this. So it's a typical repository toolkit. It does things that typical repository toolkits do, but some of them it does differently. So the first thing is, it's an offline pre-generated website. So almost every page that you go to on this repository is pre-generated and offline. So it's not running through any kind of software system. You could copy the whole page offline and it would still work if you ran this of something like a USB stick. The faceted search system is entirely built into JavaScript. So there isn't some server software that's running this. And this was because if we gave this to somebody on a USB stick and they ran it on their computer, they needed to be able to navigate through the collection as well and to do some operations that we would consider to be discovery operations, even though the entire collection is offline. The third major feature here is that we use data formats that are really, really basic. XML files and metadata stored in spreadsheets. And some people would ask, why did you not use JSON and in fact that open repositories, people asked why did you not use JSON? There isn't a strong reason that JSON and XML can be converted between one to the other fairly easily. But we want to keep this in a format that people could understand. And something I learned a long time ago is that people, especially people in the digital humanities who I work with closely, understand spreadsheets. They don't understand these fancy web-based structured data organization systems. So they understand spreadsheets. I can ask them to deal with metadata in spreadsheets and they can create something that can be ingested into a system. And most of the rest of this is typical features of repositories, so I'm gonna talk about that. And the basic idea here is that we start with metadata in some kind of a CSV file, which is a spreadsheet format, and then we convert it into XML and then from the XML we convert this into HTML. It's a two-step process along the way we create an index and this is it. There are three executable files that are currently part of the system. One is called import, one is called generate, and one is called index. And I'm resisting the urge to create any more software within this particular system. What we have done though is, we have extended this at some point so that you can add comments as well, but it all fits into the same framework. So I probably don't want to say too much about this. There is an interface. After building the first version of this, somebody said, well, we don't really know how to use the command line, which was obviously where many things start. And so I added on a Sunday afternoon one year, I added on a simple web interface so that you could upload your CSV files. And then somebody wants to log in. So we added in user accounts and entities and authority records. And all of this is done as simply as we could possibly do it. So every new feature that was added, I have to say, has probably added on the order of 10 to 20 lines of code in order to add something because the whole goal was to keep the amount of code to a minimum. What we wanted was something that could not increase the complexity beyond what was absolutely needed. And at some point we had annotations that you could create that were attached to objects within the system as well. So the search and browse was the interesting feature because it ran completely in JavaScript. And this is just a picture of what the interface looks like in one of the systems where there's a number of facets in the left-hand side. There's a query box at the top and there's a list of results on the right. And this is dynamic. It will reorganize this when you choose facets or you change the query and it's entirely running within the browser. So usually when you come up with some kind of stupid idea like let's build a little positive without actually being online, the first thing people are gonna say is this won't work. So I spent some time trying to do evaluations to ask, well, how scalable is this? And we ran a lot of experiments where between myself and my students we looked at the amount of time it took for the generation of results, for users to interact with the system, especially things like searching without actually connecting to a server somewhere. And it turns out that, yes, if you have a very large amount of data that you need to process on your desktop computer or your laptop, it's going to take more time. But we calculated that, and this must have been, I would say, six or seven years ago, we could deal with about 100,000 items in a second. And the user would never notice. In fact, no user had ever come back to me yet to say that this seems like it's slow. Why is it slow? I have not yet heard this. So we could deal with 100,000 items in less than a second. And this means, and so I should actually qualify this. It's 100,000 matches. It's not 100,000 items. There could be tens of millions of items, but it's 100,000 matches for the search query. And we thought this was not bad, right? So how many people have collections that are more than 100,000 items? Well, lots of people, but how many people have smaller collections? It turns out that there are lots of people who have smaller collections who just haven't started figuring out what they want to do with their collection yet. And this is something that we really need to think about. And since then I've worked on a lot of case studies. And the case studies had been to demonstrate that we could look at slightly different domains maybe and build repositories using a similar approach. So the first case study is something called M&Dooler. And you can search for this and you can find it and please do. Please have a look at the website for M&Dooler. And it will show you that it looks like a relatively slick digital repository system. It's all running on the simple deal software. So that basically means that your entire experience is static. It's all been pre-generated. There is almost nothing that you are doing on the system that is dynamic. The only time it becomes dynamic is if you log in and you want to post a comment. And these are just other screen snapshots. I'm not going to look at this or the skip over that. The second case study is the new Blick and Lloyd collection. And new Blick and Lloyd collection is the system going back 17 years. What we are trying to do is we're trying to retrofit the data, we're trying to take the data from there and put it into this new system to see if this can be done easily and if we can take a slightly different conceptual view of a repository and still use the same software toolkit. Because the software toolkit doesn't do a whole lot, it's a lot about the skin on the website and a lot about the collection and how it's represented. There's a lot of flexibility there. All you need to do is change your style sheets and you can change the entire appearance. Now, I'm no good at design. These have all been designed. The look and feel has been done by professional designers. And the last case, and this is part of the Blick and Lloyd collection, the last case is looking at very typical documents. So this is the network digital library of thesis and dissertations where there's a conference every year and every year there's a number of papers that have been presented. And so we use the same software with almost the default settings in order to archive the papers for that particular series of conferences because we just wanted to have another separate archive for this. And this looks more like a metadata-based repository where every item has a PDF file, the thumbnail and some kind of metadata related to it. So what do we get out of all of this, right? Where does all of this take us? After having built these systems and talked about this quite a lot, there's a few reflections that I want to share. The first is I'm not trying to build a new repository toolkit. So the idea here is that we want to change the way in which people are building repositories. It would be nice if people could quickly create new archives for small collections because if you only have your family photos, for example, why do you need to go up and install dSpace? Or why do you need to give your photos to some online service which means that some AI tool is going to pick it up as well, right? You may not necessarily want to do that. The second thing here is this is not an attempt to solve everyone's problems. It's going to solve some people's problems. And some other people may need to look for other solutions. But it's also meant to be the beginning of a space for experimentation. And everybody who I work with, I started by letting them know that this is experimental. It's for us to think about repositories. And eventually, you're probably going to have to submit your data to some larger collection that is based at the institution or something like that. The one note at the bottom about this not being a solution for all problems is quite important. If I were designing a repository toolkit for millions of objects, I'm going to take a different approach to it because at that scale, the software approach will be different. And this is a lesson that we learned from Google a long time ago that when the size of the problem changes, then the fundamental algorithms and the approach that you take may need to be different. So it's not one solution for all problems. So we've done some experiments in recent years. I think every point on this slide is referring to the work that was done by one of my graduate or post-graduate students. The last one, however, is what I've been doing since OpenRepositories. So I'll say more about this. And now, after you spent a lot of time theorizing or trying to figure out what principles should be used for building software systems and then you build software systems to validate the principles, the next question is, well, do we need to do the same thing for every other kind of software system that's out there? So let's say tomorrow somebody decides that, well, the learning management systems that universities use, well, we need to have principles for building learning management systems as well. And this seems like we're going to reinvent the wheel quite quickly. And then we as software engineers would stop and say, hang on, we need more generalizable patterns for design. So we've spent the last few months, in fact, probably quite a bit of the last year, looking at developing design patterns where we can specify a standardized approach that can be used for developing software systems that can be used in low resource environments and that can then be specialized for the development of digital repositories in these environments. So finally, some ending provocative statements. Do we really need these repository toolkits that are complex? I have to say, it's been a long time since I've used dSpace myself. I like dSpace. It works. I have had dSpace go really, really badly wrong once, and that has made me wonder if it's something that really should be using. I also wonder a lot about the commercialization of repository services. The fact that my own university can't seem to operate repositories by itself and has to go to some company, somewhere else in the world seems like the wrong solution altogether. And I think this is not just about repository services. It's about software systems used at institutions in general. And then, of course, there's a question of colonization of knowledge. Well, I think this is a very difficult issue and we could be here all day talking about this. But do you really put all of your data into some kind of repository that is physically located elsewhere in the world? How do you know what's going to happen to your data? Is your data going to be fed into the next greatest large language model that's being produced by some AI company? And what laws is your data going to be subject to? So in South Africa, we think about this a lot. And our universities will insist that data is stored only in certain countries if you're going to use online services. And I don't know how many people think about that. Then I find that the way in which people engage with data is something that is evolving and it's really not settled in any sense at all. So the digital humanities people that I work with constantly come back to me and constantly I mean every month they come back to me with some new idea or some new perspective on how the repository works. And it's not possible for me to use some stable general open source tool in order to allow them to think to do the issues that they are presenting. But I can do this fairly quickly with a simple tool because simple tool means a few changes here and there and we can test some idea. Then the next reflection here is that when we think of these archives, I think it's very important to realize that this is not the final place where that object is going to be located. That the object is definitely going to be somewhere else at some point in time. It's just a question of time and it's a question of when somebody has the will to relocate those particular objects. So in some sense, this is all a temporary location and we need to think about this in the design of our systems. And then of course we are not all online. And finally, there's a lot of compromises to make. Rescuing I think is very important. We must be able to rescue the data. Scalability, a lot of people think scalability is the most important thing. But in fact, the digital preservation in many instances is far more important than building a system that can scale to millions of objects. If you can't preserve the 50,000 most important digital objects to your country, who cares if you can build a system to access everyone else's data? And this is the perspective that I think is something that applies to us where I work. The next thing here is if we can build these things quickly and the reason why the South African version of Europeana did not work. If we can build a small archive quickly, we could do some experiments with collaboration. We could look at how we connect these things together and build a repository infrastructure on a larger scale. But if we can't build the granular archives relatively simply, we're never going to get off the ground. So in fact, one of the reasons why I worked on this project and why I'm still working on this project is because the attempt to create a South African version of Europeana failed because we didn't have any archives to start off with in the first place. And we didn't have the skills to start installing archive tools all over the place. And lastly, if you think about resource limits, I'd like to think that we'll end up with different solutions. And these different solutions may solve the problems in my part of the world, but I'm hoping that what we're going to do is we're going to contribute ideas that will actually result in better solutions for everyone because anything that is simpler and that is going to use less resources is going to be a solution. That's going to be a solution that the whole world could potentially benefit from. So there are some references at the end. Thank you for listening. I have my details in the last slide and I will hand back to the chair. But thank you so much for giving the talk and from sort of bringing the perspective that you have on the countries that you work with to the global scale at the end. I already have a lot of questions, but we at this point, I think, said that we put a pin into those and move on to the second speaker and then we can have a discussion with both panelists later. But as a reminder, if you have any questions that you want to ask of the panel, put them into the Q&A function at the bottom and we'll pick them up later. And I'm now inviting Stefano to come on screen as a quick introduction. He joined Harvard in 2022 as the Digital Repository Architect for the Repository Futures. This is a task force in charge of reimagining and re-engineering the university library's digital preservation system. And before I had this current role, Stefano worked as software architect at the Getty Trust and as Director of Application Services and Collections at the Art Institute of Chicago. In all of these roles, Stefano has researched and promoted community supportive technology, sustainable practices, and the focus on cultural heritage and academic challenges in IT. He's also been an active participant in communities, including IIIF, Fedora, Samberra, OCFL at the technical community, but also strategic level for over a decade. So with that introduction, Stefano, I'm handing over to you and the stage is yours. Today I'm going to talk about the DRS project at Harvard University. I'm sorry, hold on. And it's interesting how Professor Suleiman's and my presentation were put together. So we're going to talk from, you know, from talking about low resource countries and institutions to an institution that has a lot of resources. You know, Harvard Library has, as the money has the skilled labor and the most important we had buy-in for major repository and digital preservation services service. And surprisingly, we ran into a lot of constraints. We had to make many hard choices. So my conclusion was that no matter how much money you have, you always run short. And you have to make hard choices. So as a preface, I have to say that we are still in an RFP negotiation process. So I can't share many of the details of the process, but I'll focus on the principles and on the approach to many of the problems that we encountered during this process. So a short word about the DRS, Digital Repository Service, it's a digital preservation and a digital repository software that was established first in 2000 by Harvard Library. It is entirely built in-house because back in 2000, there were no viable digital preservation solutions. And it also needed to address very specific and complex need of UIT, which is a Harvard University IT. Today, it's DRS counts about 10 million objects and 900 million replicated files across four different storage back ends that total about two petabytes of replicated data. We have users across campus coming from 63 departments that use DRS for many different types of objects and digital files. And DRS is currently actively supported. However, as I mentioned, it's aging and it has many shortcomings, including growing technical depth. You can imagine how much technical depth we have accumulated in 24 years, especially the content model that it's built upon is very inflexible, even just to add, say, a file type to a content model for allowing a new file type to be ingested takes a fair amount of labor. The UI, the user experience overall is quite inefficient because it's built upon very old libraries, very old software that is not fit for today's expectations and workflows. So all of this led to starting the DRS Futures project, which is a three year capital research and design and implementation project to completely rethink, redesign, and reimplement DRS from the ground up. The project is divided in three phases. Phase one was the discovery phase that ran from July 2022 to June 2023. The second phase is planning that goes from July 2023 to February 24. And the third phase is the implementation phase that goes from March 24 to June 25. During this process, we have been open to implementing different options for our new repository. We've been looking at commercial solutions, open source software, and even home-built solutions. This is for Harvard Library an opportunity to re-invison digital preservation as a whole, not just to implement some software based on the concept of digital preservation from 20 years ago. The DRS Futures team is a joint team from UIT and Harvard Library members who have different skill sets, different seniority levels, and come from different walks of life. So we are a very heterogeneous group and we largely interact as peers. We don't have an appointed formal team lead. So we make decisions with the group as a whole. And we are connected with many bold departments, especially during the first phase where we solicit feedback from stakeholders. We have been doing a lot of outreach and also this outreach extends beyond Harvard into the community with events such as this one. The team is very collaborative. As I mentioned, we make decisions together, but we also spin off occasional task forces for specific topics. So the problem of redesigning and rethinking DRS Futures was approached from two sides. One that we call inductive, so from the bottom up, and the other one is deductive from the top down. From a bottom up approach, we conducted interviews with different departments that focused on those departments' needs and their work flows, how they do things, how they deposit things, how they retrieve objects, and how do they search, and so on. Also, there were department focus groups that were focused on specific topics or areas that are common to different departments, so to find out common patterns across different departments. In addition to that, we held office hours that had entirely free formats. We just logged on to a Zoom session and let anybody in who wanted to talk about DRS Futures or the legacy DRS and how they would envision DRS Futures. On the other hand, we had also a top down approach where we built from our previous experience with DRS that allowed us to set up digital reservation foundations from a technical and strategic standpoint and also draft a long term vision that was independent from the current state of DRS. That also prompted us to anticipate challenges of a more capable system and based on the growth of the future growth of DRS and the needs of Harvard Library. During phase one, we had several key points that we kind of focused on. One was separating storage and services, separating the archive and the workspace, automating tasks, and the re-invisioning digital preservation, also revolving feedback and the building for future scale. I'm going to go over each of these points in detail. For the separation of storage and service, we actually achieved this in 2022 by migrating all our data back into an OCFL layout. That was a major achievement that took over a year to transfer data, so we don't want to do that again when we implement the DRS Futures. The goal for the DRS Futures is to replace services and leave the storage fabric intact. That is pretty much the purpose of implementing OCFL, which should be seen as a permanent, persistent, the most valuable part of our repository on top of which services, which are perishable by nature, can be replaced at will. That is a pretty tough challenge, but it's a possible challenge. Another issue that we found with the current DRS is that some departments have content management systems to manage their data on a day-to-day basis and have short-term storage. Other departments don't have those CMSs, and they are using DRS as a content management system, which makes their life very hard because DRS moving things across OCFL, which creates a version every time you change something, is not a sustainable way to achieve short-term changes. One of the goals of the DRS Futures is to separate these two parts. We want to create a workspace, which is a place where users can arrange their contents, work on them for a certain amount of time, keep things in a relatively safe place for the short-term term, and then move things to a proper archive for long-term preservation when things are settled. So this approach provides users with this space and keeps OCFL itself focused on what it does best, which is preservation. Also, that allowed us to look for solutions that are focused on each of the two areas for content management and preservation. We don't have to find the magic bullet, the perfect solution that does everything. It could be multiple products that fulfill just either the archive or the workspace part. The content model is another area that will need to expand greatly in the future. We have a very functioning, but as of today, limited content model, and we want to be able to allow more content types in our repository. With the migration less approach that I mentioned earlier with OCFL, we want to first keep backwards compatibility, so create a content model that encompasses what we have currently, so we don't have to move data when we launch the new system, and that way we can defer these major content model redesigns post-launch, thereby kind of diluting and staging our repository evolution in ways that doesn't tether content model migration to software updates. Automated task is another major topic. We have a lot of users who are using a lot of their time doing repetitive tasks because the UI, UX part of DRS is not up to the challenges of today. In order to do that, we want to actually take away some of those actions and move them to the background by setting up an event-driven architecture that detects changes in the system and performs some actions, especially actions that don't need human judgment to the background in the background. We are moving some of these actions to the background and they become part of the implicit, so the users don't have to explicitly do certain things to preserve their contents. Some of that will be automated. This event-driven architecture, of course, increases complexity of our system and will require some labor, but we expect this to be paid off by the volumes of data that we can move through the system compared to what we are moving through now. Revision in digital preservation for us means that so far we've been preserving the bits and bytes of things. A lot of the contents we have in store are not very well understood. They are preserved, yes, but they are not very useful because some of the contents are not very well indexed or analyzed. We don't really know what we have for many of the objects that we are storing. We want to make sure that we are preserving not only the content but also the semantic context that surrounds the contents so we can actually, one of the purposes of digital preservation is to make contents available, not only make them go away but also make them available to users for the long term. We are also acknowledging that archival resources are live materials that change over time. There is not a final version of anything so we are accounting for this evolution and so we implement versioning through OCFL and that should address the fact that any record can be updated at any time and we have to preserve that history. With that, we also want to facilitate reusing cumulative evolution of information meaning that new knowledge about any subject, any topic, any bit of information can be updated so we want to enable research and enhancements of knowledge about what we have in our repository. We also want to encourage continuous revolving feedback and this is a quote from the OAIS specification which I want to read out loud in its entirety but it pretty much outlines not as a technical but a people and processes system that allows for a continuous improvement of a system by continuous feedback and what we actually want to do is while during phase one, we solicited a lot of feedback, we built a relationship, we built trust with our stakeholders, we want to maintain that trust and maintain that channel of communication open so we can keep developing processes for a continuous and iterative improvement of what we will eventually make available to them so we want to support not only production workflows but also communication workflows at the same time and as it's obvious we plan to grow and not only the data sets that each of our depositing our stakeholders departments will produce will grow but also we have a lot of departments that we haven't included in DRS yet. We have a large archive of 80 materials which we haven't preserved yet. We are in in talks with research data with the research departments to store research data and also whole major schools are not using the DRS yet so we each of these three points could multiply the volume and the complexity of DRS by several fold and this will surely lead to unexpected usage patterns and needs that will emerge from these increased volumes and increased versatility of the data. So as I mentioned the phase one is complete and the key art artifacts for it where a user requirements catalog from stakeholders input will aid some technical foundation technical foundational principles and requirements that informed most of the technical work and we made a weighted matrix of requirement using the Moscow notation that stands for must should could or want based on the need for a specific feature. We laid out personal profiles for users of the for future DRS users and are working on an abstract reference content model and most importantly we compiled an RFP that we distributed to potential bidders for the solution or solutions. We are now concluding phase two and this is a slide that is an update. It's very succinct because I plan to expand on this with my colleagues in over repositories in Sweden in June but I just wanted to give a very quick update on where we are at in our most current phase. So in phase two we evaluated the RFP questionnaires and had a follow up Q&A with a restricted number of bidders and we started to fit the ideal design that we laid out in case one within the solutions that are in our short list. So in this way we want to envision the workflow and the content scenarios within each solution. In order to do that we required some sandboxes from the RFP bidders so we could test their solutions with our prospective workflows and approaches to the repository. We conducted reference checks and we are also shaping the cost model because for the next 10 years because again if this is a three-year project we also have to to predict how much this is going to cost us in the next 10 years from licensing maintenance custom development and so on. Throughout the process we have been striving to maintain an unbiased position you know of course we're all human we all have our biases but also the diversity of our group is very good at canceling each other's biases and we are very open to criticism and to eventually come to an agreement on certain points that might be points of attrition. We also hired a software engineer and data engineer and a change manager that the first two will actually have begun doing some integration work that is independent from the chosen solution and we will have plenty of other work to integrate whatever solution we will have with the existing architecture. So where we're at now we are moving toward a conclusion of you know we are moving toward choosing a vendor or some vendors for the final conclusion solution for DRS and also we are narrowing our design down to that choice. We are refining the content model in the same way with the updated information about the most plausible solution and drilling down the details of the proposals that have been given us to find out you know whatever might be issues or points of discussion etc. And we are also designing the replacement of some legacy components of DRS so we're trying to figure out so if DRS will go down at some point what will which functionality we will need to replace and which component of the new system will replace that functionality or is there not any component that in the new system that replaces that functionality and we'll have to implement it ourselves so we're going through this exercise. A few conclusions take away so far these are my conclusions that you know allocating time and budget for the first phase that the discover phase paid off a great deal. We had plenty of time and discussion to actually know what we are getting into as well as approaching the project with an unbiased and fact-driven mindset helped a lot. As a result of this actually some unexpected priorities and direction emerged during the discover phase for example the separation of workspace and an archive was not part of the initial project but it emerged as a primary concern so it was eventually embedded in the project in the lead and steered many of our technical and strategic choices. So having an open mind as we kept also required an open mind from our partners from the people who are proposing a solution so we know that we cannot it's very unlikely that we will have a solution whether open source or commercial that will work out of the box so we need some flexibility from the partners that we are going to work with to make changes on either end and find out the sweet spot between the two. And in any case even with an off-the-shelf solution we are expecting plenty of customization because we have many other systems that are dependent on DRS and those systems are not going away so we are expecting to modernize some parts especially aligning the interfaces the APIs to modern standards and making things more predictable and better documented. That concludes my presentation thank you for listening and if you have any questions feel free to post them in the Q&A and I'll be happy to answer. Well thank you very much Stefano for your presentation and I would like to thank both of you for really perfectly keeping to time in fact we're a little bit earlier. I would say I would like to invite you back onto the virtual stage and then we can transition over to questions. I have quite a few but I've seen that the first question from the audience has already come in so maybe I'll pick that up first. And this is a question Stefano for you and it is whether your solution is going to be on-premise or whether it will be cloud-based? I don't really know how much I can talk about this but we are still debating which solution works best for us. We have at Harvard a policy that constrains us to at least having one copy on-premises so we will very likely have a hybrid solution cloud is of course very very convenient. As part of a digital preservation strategy we want to achieve as much diversity of storage as possible so diversity of supports, diversity of platforms, diversity of locations and diversity of providers so we will very likely have a cloud-based at least some cloud-based most maybe mostly cloud-based copies and one on-prem copy. Thank you that was very helpful. I have a sort of follow-on question that I've been thinking about while you spoke about this. One of the challenges I guess with anything of that scale but also evolving of rapidly as you think you spoke about is future growth and how to model that could you Stephanos say a little bit more about how you sort of approach that scenario planning for for the next years ahead and ideally I don't know whether you can if you could also share something if you are considering cloud in that context there's obviously not just the cost for storage but there's also the cost for moving data around which in some cases can be notably higher. Are you ready at the point where you have sort of some ideas on where the growth will go and what that might mean for budgets and the whole architecture? We have a billing model with DRS where the individual departments pay for what they store so you with the Harvard IT department that holds the solutions does a partial recovery, cost recovery through the departments that deposit their materials and so those departments have to pay for storage somewhere. So in that in regarding that there's kind of that problem is kind of spread out across the departments. Our main problem will be if the system as a whole you know the indices the management can handle having terabytes and terabytes or petabytes of research data added you know if now we have two petabytes we might have 10 petabytes in another few years so that is a complete change of scale and we you know of course during our RFP process we inquired about you know the scalability of the solutions and we are looking at vendors that can guarantee that have implemented very large repositories and aside from that there is also you know there are some talks with the departments that are planning to implement to use DRS as their preservation backends you know schools that use research data or the AV project requires spatial handling and won't be implemented very quickly they will need a lot of not only not only material time to transfer data but also organizational and probably some added complexity in the content model as well. Now thank you that's very interesting as we already had quite a bit of preservation I maybe have a follow-on question that is for both of you and this sort of comes directly from me I'm a historian by training which means I'm used to looking at a historical record that's not complete and just accepting the fact that things disappear. Now we see in preservation what we're trying is not to make things disappear or at least I was going to go back to your point to give us a good enough chance that we can rescue most of the material there but we are obviously looking at a point where lots and lots of data is generated and it brings up the question should the aim really be in a research data or collections context to absolutely preserve everything or are there principles that we should and could apply by deciding the factor in cost and others what's sort of worth keeping and protecting to a higher level and what material we might perhaps discard after a while or where at least we might say we are not spending as much effort on to preserving it and Stephanie as I've asked you know twice in a row maybe I'll ask that question over to St. Fest and then then you're welcome to respond. Sure thanks Tostin I think this is something a lot of people don't think about and many years ago my institution was trying to develop some kind of policy on on the kind of digital repository service that Stephanie was talking about and they didn't really know how to think through this so I made some suggestions to them and I said well we need to determine who owners of data are and a billing model for the people who own the data and then we need to decide whether the data has value to those original owners of the data and if the data has value to the institution beyond that because there needed to be a decision point where somebody who owns the data needs to fund the continuous management of the data and the preservation of the data for some point but if they're no longer able to the institution then needs to make a decision at a higher level do we continue to fund this fund some portion of it and how much of this do we fund indefinitely and in fact I wrote this up as an algorithm and I gave them and said here this is what you should be implementing because it would be the balance between control management independence of the the researchers who have the data but also not losing the most important things that the institution values and you know as Stephanie was talking I was wondering if you had some system in place like that for the institution to take ownership of the most important things and I'm sure you can comment on that. Yeah well I was going to say that you know my my opinion is that you know preservation starts at deposit because the information that you're putting is what's going to make it able you what's going to make it useful you know the scope of perfect purpose of preservation is not to keep things but to make things useful and available right but you know after after hearing you saying I think that it's even it comes even earlier it is a you know political and budgeting decision you know the institution decides what's important and that might be biased but it's the best way we can we can achieve preservation because we don't you know everybody has finite resources we have too many too much data and too little you know workforce that to to classify categorize and make everything findable in detail so we have to make choices about that. I don't know if that answered your questions. Well thank you both I think we can maybe pick that up later again but I'll see that we have a few questions coming in on I think a really interesting topic that's also very much on my mind so I'll combine these questions together they are all on the environmental impact so there is partly the question of considering the growth of storage requirements and I would also argue networking and and compute requirements is the environmental impact of these large data repositories that we're building a concern and this is a question for both of you and the follow-on question is how far either of you in your projects and your research are considering the environmental impact and how they could be reduced and if you do look into it if you have any recommendations that others could apply so concerns about the environmental impact how far it factors into your current work and any recommendations that you have for the audience if you'd like to go first. Shall we change order Stefano? Yeah okay yeah the environmental impact is part and parcel of our RFP questionnaires so we have actually questions for the bidders about their how they evaluate environmental impact how much environmental concerns are you know how you know if they have any examples of you know environmental concerns and how they and how they address them in their in their system in a way you know environmental impact goes hand in hand with resource usage so more efficient software more efficient computing more efficient storage is you know has a lower footprint environmentally but yes we made it an integral part of our of our research process and to a certain extent we found out that it's there are not real specific and defined metrics to actually give an actual size to to environmental impact of a repository there are some efforts in this in this sense but I think it's a very it's a very active area of research that hasn't you know is not needs a lot of improvement and expansion. Thank you Stefano. So I'll just add here that I think Stefano just said something that's very important for us which is efficiency efficient use of resources and and in the case of building small repositories in low resource environments or any repositories in low resource environments we just don't have the resources so resource efficiency is incredibly important. I know that a lot of people who start off let's say an early repository not not on the scale of what Stefano is talking about but many people at institutions start off a repository project by first going out there and acquiring some computing hardware hiring some staff to do this and I think that's a very resource intensive approach and you don't really need that amount of resources for something that is not a fairly large repository initially especially when you are starting off so it should be possible to support a lot of what you want with the bare minimum amount of resources and I think we have to design for efficient resource use in everything that we do and this is something we haven't been doing consistently in repository design at one point at this point. Thank you both. I think the efficient use of resources seems to me a good point from a whole host of perspectives also from cost. I'd like to sort of maybe take us back briefly from this perspective onto the question of preservation and rescue of material. A while ago while I was still in the UK I was in a discussion on involves on preservation of digital collections in the cloud and some colleagues involved expressed some concerns about the cloud and others said well maybe the cloud might be safer in our local system and I'd sort of half jokingly but only half jokingly thrown in the recommendation to just once a year take a copy of everything on tape and bury it in a colleagues back garden as both are cost effective fairly secure and hopefully also environmental solution and I've been thinking about this a lot in particular since and you may have seen this a few years ago there was a cyber attack against the British library that locked much of its digital infrastructure and my former colleagues are now working really hard to bring systems and content back online which I think is a classic case of probably both rescue and preservation. So what I'm wondering about in this context overall for our repository infrastructure how far have we really designed our repositories for rescue? I think there's a lot of talk about preservation and say preserving file integrity but I certainly have not that often been involved in exercises or planning that model for something a bit more drastic and then think about what rescue we're taking. Is this something that we are already doing and maybe it's just passed me by and if not how can we sort of apply that kind of thinking most usefully to repositories and that may be this point Hussain if you would like to start and then Stefan a few for you to add in a new information so you might have. I wish I had really good answers to your question but this is a topic that that worries me a lot. I think that the starting point of burying the tapes in your backyard you know there was a time when I thought that those optical discs were going to be useful and we would gradually increase the lifetime of those discs and eventually it became something that nobody uses. So you know what do we have we have magnetic tape what's the expected lifetime of magnetic tape and some people have told me it's something like 30 years maybe I'm quite that wrong. I think I do have some 30 year old tapes I should try them to see if they work and I think it's very worrying because we don't know that the actual bits are going to be preserved but this is not a problem I'm currently addressing. I'm hoping that people in the engineering space are going to come up with media that will have a longer lifetime so the bits can be preserved. Another question that I want to ask is if the bits have been preserved and if somebody comes across my digital collection buried in the backyard and they're able to extract the bits from there will they be able will a reasonable technical person be able to reconstruct my archive relatively easily even if the operating system software everything else does not work and so this is this is kind of the principle that I use in in the tools that I design that somebody knows nothing about this if they look at it and all they see is plain text it turns out that a lot of the the core is plain text so if you can read plain text in 100 years 200 years 500 years hopefully you can reconstruct this assuming that we can solve the bit preservation problem. Thank you. Stafford? Yeah yeah I'd argue that I don't think burying tapes in a backyard is a very environmentally friendly I wouldn't eat the tomatoes from that backyard but jokes aside. I think Hossain raised a very very good point you know one thing is preserving the contents and the other thing is interpreting the contents from for someone who doesn't know anything about you know they can't come across you know this this trove of materials and they don't know how to interpret them. Recently in the OCFL community there has been a proposal for for a specification enhancement to include some application profiles which means this is very very new so I don't know where it's actually going to go but there is a proposal to include some way at the top of an OCFL file system to provide some instructions whether human or machine readable about the contents because as you might know OCFL only specifies the folder structure and then there is a content folder that is completely opaque to OCFL and this proposal is to provide some context about those that content folder so that you know anybody who you know digs up this this data can can know how to map them to to some meaningful content but yes it is it is a big concern especially retrieving data you know the British Library case is very very much you know in our mind these days that there was a very recent update on March 8 I think about their the process is very very interesting to read and that actually led us to some more talks with some storage backend providers about to find in a solution that actually is a disconnected copy that is entirely offline and we met we were met with some surprise because many of these storage vendors are you know for the fast cheap option not for disconnected slow tape based solution and they wanted to be connected but they kind of acknowledged our point that you know being disconnected sometimes is is safer at least you know in one instance well thank you that's very interesting I am now I'll be going to a question that just comes straight in although I show you everywhere else who's put questions and I'll come to you but this sort of follows directly onto this we have a question that links to the cyber attack on the British Library and I was saying that specifically for you the question is are they increased or reduced security vulnerabilities when taking a minimalist approach right so I think this is this is tricky because the notion of building secure systems changes over time so I would have to say that we can't give up the notion of building secure systems but while keeping in mind the various approaches to building secure systems what what can we reduce in terms of system complexity assuming that you've done what you need to do to build a secure system and you know I have to say unfortunately based on all the things that I know our systems are becoming more difficult to protect over time and so that security layer is becoming more complex and this is a problem for people who are working in security to deal with but once we can somehow manage that security layer and maybe this is the way that we need to start thinking about that security is a layer over the service provision layer rather than having security and the provision of useful services deeply integrated such that we can't separate this because the complexity of each of these can be controlled independently and I'm pretty sure the security layer is just going to increase in complexity but I'm hoping that the complexity of the repository layer decreases over time thank you for this I don't know if you have anything to add but I also have a direct question for you so feel free to respond to both or only pick one of them this is a question that is about the discovery of the content and the colleague is wondering will you have integrations between DRS and the library's discovery services or tools so you can have one search across multiple platforms the discovery layer is an entirely different project but yes DRS will feed most of that of that discovery layer because that's I wouldn't say it has everything but it has almost everything and it can provide a comprehensive view of all our contents so yes there will be integration we have some some MongoDB databases that aggregate data from DRS and from non DRS sources that feed the discovery indices thank you now slightly changing tack I have another audience question on future gazing that's I think for both of you what technologies are we using in our repository infrastructure that we should be or maybe emphasizing the future gazing aspect of this what technologies might be on the horizon that could be really interesting for repositories that we should consider let's start with this first so there's there's a number of experiments that I have students working on and and future students working on that are relevant to answer this question the first is the the fairly advanced data analysis tools that we see coming out of the computational disciplines the digital humanities and what I what I'm getting from users of repositories is requests for services that simply do not exist in any repository and so I've had a few students go off and interview a large number of researchers to find out what are the expectations what would they like if they could have anything they wanted and what we are getting is a combination of people who don't really know what's possible and others who do know something but they're suggesting services that we don't really build into most repository tools so especially and when it comes to things like discovery the the way in which people think about discovery is a transition between digital repository tools as we know them GIS tools as we know them computational history tools all combined into one and and so I think the there's a lot of potential for really advanced interaction with the repositories our repositories are too simple at the moment the the second thing I'll mention is what and and what we've also noticed is that in the last maybe 10 to 15 years the computational scientists have started to build repositories with massive amounts of data and these are the people like the astronomers the physicists the computational chemists the climate scientists and what I have noticed is that they have invented completely different ways of building repositories such that they can deal with very peculiar problems like being able to to slice and dice very large datasets that have very specialized functions and I think that what we can what we should be doing in the future is learning from what they have developed and incorporating some of those ideas into things like large-scale repositories as well. Thank you. Stefano anything to add from your perspective? Yeah I mean it's not an IT conference if you don't talk about AI right I preface I'm not an AI enthusiast I'm actually in an AI skeptic but I also acknowledge that there are sites aspects of machine learning that can be very helpful to digital preservation which is as I said before helping with the with the deposit of things and depositing meaningful things is very critical to to the fruition of those contents later and you know with the growing scale of input you know the input flow machine learning can help categorize massive amounts of data and and brand more accuracy in the material we preserve so that's something that we've been very much like putting out there as a very long-term goal for when BRS will be operational and efficient and effective and we plan to probably run some pilots on tagging and categorizing things on deposit and make them more useful more meaningful for future users. I think you're touching a few really interesting points and I would assume AI will be on the minds not just of people who are going to technical conferences at the moment. I may be asking a question that links into collections and repository that's on our mind I think here at the University of Chicago like I would imagine probably many other universities that have certain amount of data and publications we're being approached by tech companies who want to use our content for training their tools. Now obviously some of our content will live out on the internet already and in particular open access materials are freely available for most if not all use cases. What is your view on how far should we aim to as widely as possible make all our content available for AI's to train or should there be any restrictions or safeguards that we should put up and if you think there are any indications you could give us what they might look like it would be really interesting to hear. Well can I jump in to say I'm also an AI skeptic so I've tried having students do automatic classification well named entity recognition of the contents of a historical archive and it works so poorly that we are staying very far away from any kind of large language model solution at this point in time. So I'm not sure where this AI movement is going but yesterday a parent of a student went online and they were using Bing they clicked on the co-pilot and they asked the co-pilot a question who is Hussain Solomon and Bing co-pilot got the answer completely wrong I think it made some major mistakes in the first sentence and the parent was completely confused and this is the only reason I got the query because I had I was not in any way involved with this particular student and this was from relatively clean data that was available online and the AI systems were not able to deal with it I think we have a lot of progress still to be made in producing results that are reasonable and that meet the standards that we generally expect and I'm not sure given that I see the other side I see a lot of research where people are building these language models the work that people many people are doing seems to be focused on building larger language models that can solve a larger amount of problems with the greater accuracy but will not deal with the absolute requirements that people have when they ask questions so I think unless until there's a shift in the kinds of machine learning research that's going on out there until we start to make progress in terms of actually getting us the kind of accuracy that we need I'm not convinced we should be sharing large amounts of data with these companies that come along asking for the data because all that is going to do is it's going to shift the needle very slightly in terms of accuracy but it's not even going to produce a tool that you can use at your own institution so I can't somebody asked me this question yesterday somebody from our learning technologies group I can't recommend any AI tool to the university at this point in time because I don't think there are any tools good enough for the teaching and learning enterprise well Stefano you said you also an AI skeptic I don't know if you have a different take on on this question slightly different my skepticism comes not as much from the technical side I think things will improve as in any research field you know AI is very green in a way and it's very you know there's been a lot of enthusiasm about it maybe premature but I don't see why it shouldn't improve my skepticism comes from the fact that there are just a handful of companies controlling this and that's going to concentrate the control over a technology that is extremely important in the hands of a handful of companies and I'm not going to open the can of worms that there is very very long discussion but it's to the opening libraries repositories to those companies I don't really have a problem because if is if libraries mission is to make information available to everyone including for commercial purposes we could should kind of stick to that decision except if they these companies that use data our data for commercial purposes are eventually gonna obscure and kind of smaller through just sheer volume the services that we offer that are unbiased and and free to everyone so if if that were to happen I would have a problem but I think that as libraries you know are free to everybody everybody means everybody I think that's an important point I was also reminded us just recently in a discussion LinkedIn to AI where there was a question on whether the access of AI to repository content that has creative commons licenses could be reduced and I think I was the person in that discussion making a point if you want non-commercial unrestricted use of your content then it's very hard to restrict that content but I know that discussion keeps coming up in different contexts where while we embrace the principle at least and some of us look at some particular edge cases or currently it's not even an edge case and ask well was this what we meant originally when we sat open to everyone or not so I think it's an interesting tension that we probably will see come up a few more times on this and if I may add one little thing about this I think there is another issue with machine learning and AI datasets is that they are mostly biased toward general purpose commerce driven purposes so if you feed them library data most of them won't make sense to them or would be wrong as Hussein found out there is no there is no library archive museum specific data processing for AI there is a AI for LAN group that is addressing this and I think that's a very interesting development meaning developing not large language models for cultural institutions that are more meaningful to us yeah thank you I think that's a useful reminder so we're coming to the end of the session but I think I'll probably take time for one more question taking a step back from individual repositories and looking at sort of the wider repository landscape or ecosystem where in the last few years there's been a lot of discussion about next generation repositories and making repositories more infrastructure more interoperable make it easier to exchange information so bring this back as a question to both of you maybe Hussein starting with you do you think repository infrastructure where lots of individual repositories are built on your simple design approach will be easier to be interoperable or do we them sort of need a more complex layer for data exchange sharing or discovery oh boy so I have lots of thoughts on this because my I started life just building repositories and then dealing with interoperability I have to say that I think that we haven't spent a lot of time thinking about interoperability of repositories in a long time and the world has changed and I get the feeling that we need to pick up this topic again we probably need solutions at multiple levels so in the world as I see it in my head if we have lots of small repositories there could be a simple way of developing interoperability among small repositories but as we scale that up and we create larger systems we may need more complex systems for interoperability at that more complex layer so the type of interoperability that we have should match the size and the complexity of the kind of repository we are dealing with in some kind of natural hierarchical system and I think that might work because I haven't yet and I've done quite a bit of experiments with this I haven't yet been able to figure out how we can build single standards that will work at multiple resolutions in terms of scale so that's it's still a bit of an open research problem but right now I'm thinking multiple solutions is where we want to go thank you that's a really interesting steer and from a slightly different perspective the same question Stefan and for you how much has sort of interoperability information next change of other repositories and that sort of thinking around next generation repository paradigm influence the approach that you are now taking at Harvard? Well in my previous job at the Getty we were very linked data friendly so we you know as you know that the Getty is a provider of many vocabularies that are shared by other institutions so we had a very different approach at Harvard we are the repository is you know feeds our publishing systems so that are published but there hasn't been any serious discussion about interoperability especially with other institutions and that's something that I have you know that's one of my a very topic that I really am into but we haven't had any any specific discussion about this and just one note about I really appreciated what Sain said about OAPMH that you know developing some standards for interoperability that are as simple as possible you know the case of OAPMH that you know a developer is able to make an implementation mean one day is a critical point for implementing standards which mean being able to interoperate better