 Hello, and welcome to today's presentation on reinvesting in the institutional repository, redesigning infrastructure, re architecting the platform and reviewing policies. I'm Cynthia Hudson Vitale, the director of scholars and scholarship with the Association of Research Libraries, and I'm joined today by Dan Coughlin, head of strategic technologies for Penn State University Libraries, and Seth Erickson, research data librarian for the social sciences within the university or within Penn State University Libraries. And before we jump in, I thought it'd be helpful just to do a little bit of way finding through this presentation. I'm just going to spend a quick minute talking about university based publishing. I'll be followed by Dan Coughlin, who is going to be providing a specific case study on institutional repository redevelopment through the technical infrastructure lens. And then finally, Seth Erickson is going to be finishing up talking about not just function additional functionality, but also about policy restructuring that needs to happen within academic libraries, as they think about reinvestments and restructuring. So, today's research libraries contribute to the creation curation and publication of scholarship through a variety of services. Many university library presses have reporting relationships through their libraries. Many libraries have their own scholarly publishing operations, and many, and or provide services for the publication and deposit of research data and associated materials. And critical to the success of any university based publishing system is meeting external and internal community needs, and we're going to see a fine case study in just a minute from my, from colleagues at Penn State University, who did just this. In the institution of research libraries, we have a number of major publishing priorities, including supporting the supporting OA monographs through our toward an open access monograph ecosystem program, building community with our press and library publishing program, and supporting academic research library needs around research data services. Many libraries nationwide are now reinvesting in their local infrastructure and services to meet evolving needs, specifically around big and long tailed data pre prints and post prints, especially in association with open access policies, or transformative programs, presentations, and OA books and book chapters. So now I'm going to turn it over to Dan Coughlin to talk us through one such example of where this is happening at Penn State University libraries. Dan, over to you. Hi, Dan Coughlin, head of library strategic technologies at Penn State, and I'm going to talk a little bit about the life of scholars here, our institutional repository. We're going to discuss our reinvestment in our IR restructuring based on our most recent work. It's important to know a bit of where we're coming from. So my hope is provide some additional context into the decisions we've most recently made and I'm going to group my discussion around three separate points in and that's investigating and developing our IR for at least nearly 10 years ago in 2012, maintaining, updating, migrating, running, adding features to our IR, which, you know, about a six year period from 2013 to 19, and investigating and coming up with a plan and investigating on that plan from 2019 to 2020. And I'll conclude with where things are at today so maybe that's four points in time but really the four point fourth point is just the conclusion. So in January in 2012, Penn State University libraries decided to develop an institutional repository for the university's growing data management needs at the time. The university libraries was interested in becoming more involved in open source software community development efforts. And at that point, many people that we had spoken with had an existing IR in place. So we had a lot of freedom to choose a platform. We did not have to worry about the burden of data migration or anything that would come with that. Some of the complexities involved there. So we considered investigating an off the shelf turnkey solution such as the space. We had a prototype that we had just built called curation architecture prototype services. So using like a microservices approach that we had considered or building on top of an existing platform. And ultimately we decided to build on top of the existing platform using Sam Bear. We did not want to develop a solution in a vacuum. We thought a group with a relatively common set of problems would be helpful for problem solving and working together so working within the community would also get our team increased exposure, perhaps help with recruiting future potential colleagues. And we were excited by the promise of working with and contributing towards a larger community. In 2012 our team had a bit of apprehension I would say about building this alone. So we're happy to be working with the support of the community and within their set of processes. There was a desire to get sort of a common solution to easy setup right for other repositories for various needs within the libraries. We hope there would be an ability to plug and play various components or features. But our hope was that with the model that we were choosing and building on top of a platform, we'd be able to write code that could be used in other repositories. And we could use this code that other institutions had written for their repository needs and build applications more quickly. Our community was initially called Hydra because of the relationship with the mythical creature that has several heads. So we were considering that potential of running a core storage infrastructure and discovering infrastructure, while developing several heads or applications for our various repository needs. We knew this was a bit of a lofty expectation but also thought that this was a good design principle for us to advance. So the initial roll out of scholars here had met the many needs and many of the wants that we had defined initially in our development requirements. And the big takeaways from that release of our initial repository back in 2012 building based on this community based platform was we quickly met the needs for our first release. We were able to develop a solution that could be extracted for an additional repository and we did build another repository on top of this. And we were working productively within the larger community, very thankful to have, you know, community of people to bounce ideas off of and help us with problems that we were running into as well. Over the next six ish years is really where, you know, the users had started using it right so the local needs of the Penn State community started using the repository and testing out what we believed their needs were and what stakeholders has had communicated what those needs were. So you'd start to see people pushing the limitations of infrastructure, evolving needs in terms of functionality and at the same time the community that we built this platform on top of was really expanding. And through this time, we had three major updates to our to our, our IR stuff to say that we had a number of like minor releases and stuff but three major releases. One was migrating the data object store to a new version. Another was a user interface of the application a complete overhaul of that. Another was migrating the data model so not just the data store but the data model to the Portland common data model that that the community had defined. So two of the major two of those three major releases that that I mentioned above were largely community driven. In one case migrating the data object store. The initial repositories within our open source community to migrate our data storage system, and we anticipated that doing this work early migrating to the new one would prevent us from rewriting any code that relied on that data storage later so the assumption was, let's get off of this old system now so that we don't have to write code on this old system and then write new code when we migrate to the new system. But in practice, it didn't really work out for us this way. We ended up having, you know, more limitations based on this migration than feature benefits. Concurrently, we were coming up against other challenges that were proving difficult to solve in a sustainable or scalable way. Large file sizes that have described large file sizes anything larger than a gigabyte for our purposes. For uploads and downloads remained an issue that researchers seem to more frequently be encountering our mechanisms for getting around some of these obstacles let us to looking at an API for administrators of the system and other applications to integrate with. So for example if the web browser upload was not working perhaps we could get physically get the file from a user and upload it to the system as ourselves. Maybe we could use an API to do this more systematically, but we didn't have an API. When developing new features we would have this like consistent tension that we would have to balance and struggle with which was should this code contribute back to the community or only for our local needs. And our experience was that frequently you know the devils in the details and while several institutions were interested in a feature based on you know the conversations we were having the implementation could be much more detailed and difficult to find common ground. The complexity of that and the complexity of that information led to longer timelines perhaps more difficult planning for local development features that may be you know, aside from that. So, in terms of adding infrastructure right we both because of that scalability problem that we were had mentioned them. We enhanced virtual capability so adding more CPU adding more RAM to our systems, and then we tried to offload some of the tasks to other systems so we didn't want the system that our users were interfacing with on the web to be responsible for the heavy listing of tasks like characterization or indexing and metadata for search or creating thumbnails these kinds of things we created other systems to do that. And adding the additional components and or systems and improve the user experience, but made our infrastructure difficult to manage with certain infrastructure smells that we we call them. We're continually trying to push our systems to reflect best practices, but the infrastructure smells you know we're essentially these anti patterns of best practices or symptoms of a bigger problem, and that included storage coupled closely to the application. A lack of flexibility to scale storage to integrate, unable to spin up a scholar sphere instance, and a lack of flexibility to decouple small tasks like that I mentioned characterization indexing for metadata that may require increased resources doing that work was extremely complex. In 2019 and 20, although we were coming up against some of those struggles that I mentioned and continued maintenance, you know, scholars here was considered you know a successful project that we software project that we had several things we liked and likely took for granted. It was important for us to recognize, you know what features and characteristics of scholars here were a part of this list. The role was flexible enough to support several current use cases and future needs. It was developed with a significant amount of community input. There are other development teams within our organization that were also developing new applications in Ruby, so the programming language that that we were using continued to be relevant within our larger group, as well as the Ruby on Rails framework and Solar. So some of the libraries developed with these frameworks were providing us with struggles, however, the languages themselves were so flexible enough for us to continue our work. And so at towards the end of the summer of 2019 we were given a 25 gigabyte video file to upload into scholarship or a set of files that that equal 25 gigabytes to upload to scholars here. And the parameters of that request were outside of what we could support from the web interface that typically a user would upload files with so. And like I said, we had no API to allow our product owner or to develop against and work with the researcher to meet this request. So we spent approximately one month working with the data and our system for a single developer and we successfully ingested the files into scholars here. At the end of the month, after going through that process we decided we need to more urgently evaluate our path forward because we could not have, you know, a single developer spending a month of their time on a single user request. And so we're in a much different position, you know, in the fall of 2019, then we were back in 2012 when we first started developing scholars here. We understood our technology stack more we understood the repository space more, and we understood our users needs more thoroughly. So, in October of 2019, we decided to start from scratch and spend about two months developing a new solution or a pilot and evaluate our path forward after that. So, starting from scratch was worth the continued investment or was grasses greener type of adventure so that was our thought like, if we spend two months, this would on developing this new. So, our sort of based on from, from the ground up. Was it still worth it after two months and we thought two months of development would give us clarity and making that decision. So, we had three goals in those two months that we needed to test one we wanted to improve the stability and scalability for local needs to want to improve our ability to get an environment up for developers. More simply, and three we wanted to be able to onboard new developers more quickly to working with us and so to test the first goal, we wanted to see if we could upload the files from that 25 gigabyte request in the summer. We took that month long manual process down two minutes. Right so shortly after our prototype test proved we could meet the local needs and scalability. We're able to test our second goal, getting a scholar sphere environment set up easily. The process of setting up a development environment went from days to hours right so we had now reached two of our three goals here in the two months and we were doing, we're now doing the development with two new developers right so two people that had never worked on this platform. And having been able to contribute to meeting two goals and starting work on the platform. It really helped us prove that we had reached all three of our goals and we were able to more easily onboard that new develop the new developers to this platform so we decided to continue down that path of new development in late 2019. And, you know, a little over a year of development in November of 2020. A year ago, we released our new version of scholar sphere. We met the initial development team goals like I mentioned, and the goals for our users. We used our own internal API as we plan for data migration from our existing fedora common storage system to the new one in Amazon s3. Over the past seven months, we've done nine feature releases including, you know, collections and an enhanced API to support Penn State's open access initiative. We also have more than doubled the physical storage size of our repository since releasing it a year ago so we doubled our repository size that was around for six years in a little over or a little less than one year. In the summer of 2021 so just a couple months ago we were able to meet a faculty members request to upload 30 to 40 videos of 300 to 400 gigabytes in total so that's a request we would never have been able to meet in our prior solution. And maybe these technical decisions were made because of the change in dynamics of our team. And perhaps the single biggest change was around the experience and the confidence that comes with that, selecting a platform and an infrastructure to support that platform can be daunting. It's particularly difficult when you have so many questions in front of you about how the system will be used, the demand it may be under the need to scale, how to deploy new features update dependencies, etc. So our decisions in 2019 were made with much more experience and understanding of what was required of our system, as well as what was desired by our users. And so now Seth is going to discuss some of those desired desired features of our users that we provided with this release in a bit more detail. Thank you. I'm Seth Erickson. I'm a research data librarian at Penn State. As part of the talk I'll introduce a few of the features and policy changes implemented with the new version of ScholarSphere. The first of these concerns big data. In terms of storage size research data sets span many orders of magnitude from kilobytes to petabytes. This range is so big that as a practical matter technologies for data management and analysis tend to be oriented towards particular parts the bigger the small end of that range. This is why big data is often understood in relationship to tools that do or do not accommodate it. The tool with big data in other words isn't just that routine data management tasks like storage transfer preservation and curation are more complicated and resource intensive. It's also that the tools and technologies, the strategies developed to work with big data, don't always make much sense, much sense for all the small data that we also need to store transfer preserve and curate. This data introduces yet another dimension of special considerations for IR architects to navigate. So we approached the redesign as an opportunity to be more intentional and explicit about the types of data ScholarSphere is meant to accommodate. We identified deposits like the one illustrated here, which is about one terabyte with file sizes up to 100 gigabytes as our sweet spot. To make this possible, we moved to a cloud based storage architecture which makes uploads and downloads significantly more reliable and scalable. The block storage paradigm as opposed to traditional file systems allows us to accommodate larger files as needed in the future. For now, though, we refer larger data sets to our partner repository the data commons at Penn State which uses resources provided by Penn State's high performance computing cluster. Versioning is another important feature we've introduced the prior iteration of ScholarSphere didn't support work versions. If a depositor updated a project, there was no way to access the original materials. So versioning is essential for computational reproducibility as it provides additional assurance to future users that they're looking at the work as it was published at a particular time. For example, ScholarSphere implements versioning at the file and metadata level. A work is defined as a sequence of versions, each with potentially unique metadata and files. Versions are in one of three states their drafts published or withdrawn, and the most recently published version is presented to users as the authoritative version of the work. ScholarSphere version history UI with three versions and the changes made for each version. And all three versions are published. So these technical improvements go hand in hand with policy changes that I want to briefly describe. During the redesign of the platform, we formed a policy review task force consisting of curators scholarly communication specialists and subject liaisons. The first identified shortcomings in the existing policies reviewed policies from similar institutional repositories and presented recommendations. Some of the key recommendations are presented in new policies covering curation and preservation and you can read the full policy document at the URL on the screen on this slide here. The policies describe how the library, how the libraries will curate deposits to make them findable accessible interoperable and reusable. In most cases we work with depositors to improve the materials they submit however the policy authorizes the library to make some actions, excuse me to take some actions without permission from the depositor. These actions include corrections and enhancements to record metadata and the creation of derivative files for the purpose of preservation that access. The policy also highlights Penn State's participation in the data curation network, indicating that submissions may be referred to specialists outside the outside the university. Finally, the recent trust principles for digital repositories encourages an explicit statement on minimum digital preservation timeframes. So we looked at expectations from on data retention from funders and we set our timeframe accordingly. So we will aim to keep everything for the life of the repository. But after 10 years the libraries may remove content that does not warrant continued preservation. Our full policy framework is again accessible at the URL here. So with that, I'll wrap up this presentation on behalf of Cynthia and Dan thanks for your attention. And we welcome any questions or feedback.