 All righty, thank you all for joining After Lunch. It's my pleasure to present this afternoon's session on our work deploying NVIDIA RDM as an institutional repository platform for data, software, and publications. I'm Tom Murrell, I'm the Research Data Specialist at Caltech Library. My background is actually in computational chemistry, I'm sorry, as a researcher before I joined the library seven years ago to help with institutional data and software management. And that really helps me focus the services that we provide to really meeting the researcher's needs. And I'm going to be talking all about the work that we've done bringing up our data and software repository platform. The slides are available on Zenodo, so you can grab the DOI at the bottom of the slides now, or you can just search and find the slides afterwards. So I wanted to start with a little bit of context about Caltech because it's a little bit of a weird place. So it has a big scientific impact. There's been 46 Nobel Prizes associated with the Institute. We manage a lot of large facilities, including the Jet Propulsion Lab, the Palomar and Keck Observatories, and we co-manage LIGO. At the same time, we're really small. There's only 300 faculty, 1,000 undergraduates, 1,400 graduate students. It's like a liberal arts college managing very large scientific experiments. And that makes the library also a little bit unique. We're a small group, but we have a really big impact. So we've run institutional repositories since 2001. We've got over 100,000 items. I wanted at this point to acknowledge some of the folks that helped make this project possible. So from the digital library group, we've got Robert Doyle, who helped with all of our authentication, Tommy Keswick, who did our theming and CSS work, Mike Kucca, who helped with the GitHub integration and our fearless leader, Stephen Davidson, who's also here. And then on the NVIDIA R-DM migration team, we have Kathy Johnson, who did a lot of the VOR mapping, which I'll talk about a little bit later. George Porter, who manages Caltech Authors and that massive collection of content, and Tony Diaz, who helped a lot with our data cleanup and testing. I'm gonna start with what I spend most of my time with, which is Caltech data. So Caltech data is our institutional repository for Caltech researchers to upload data sets and software. The reason we wanted our own data repository was we really wanted to make it easy for researchers to deposit their content. We know there's a ton of really valuable research data sitting on researchers' laptops, on lab servers, and if we don't get that content into a format that we can actually preserve and use, it's just gonna get lost. And we wanted to make sure that this was as easy as possible for researchers to reduce their burden. It's also really helpful for compliance with publisher mandates, as well as funder mandates for sharing both data and software. We started in 2017, and it's been really successful. So we've got over 26,000 records, over 10 terabytes of data in storage. The vast majority of the records have been automatically generated with our API. We're basically automatically transferring in records and files. But we do have a significant amount of software that have been generated from our GitHub integration, as well as individual single one-off deposits where researchers come to our deposit form, upload their files, describe them. And we've had submissions there from over 6% of our campus users. So we've gotten a pretty broad base of submissions. What types of things do we have? We have kind of the traditional data of a big sheet of numbers. We also have software. We have simulation results. We have stuff that doesn't fit in other places. We actually have a lot of early AI vision training data sets, like this data set from the Mars yard at JPL. We have cases where the photos themselves are the research objects in terms of this set looking at some rock samples. And then we have images that are almost arts. This is a map of Titan, where they pulled in different data sets and generated this really pretty map. And we've also got three-dimensional data, like this mouse femur. So we've got all sorts of data sets from all sorts of disciplines, all types of data, and it's all stuff that wouldn't fit into a traditional domain repository. So you really need a generalist type repository environment to be able to collect this content. So what were we looking for in a repository? The initial version of Caltech data was inspired by Zenodo. So we worked with a hosting provider, and we said we really want the self-deposit and the ease of use that we see with Zenodo. And fortunately, because Zenodo is open source software, we were able to actually basically grab the code and then modify it for our needs. Again, the key feature was it's easy for researchers to go in and describe and upload their files. And philosophically, the researchers are the ones that are in control of their records. So from the library side, we're able to help to make suggestions and improvements, but it's really the researchers' data, the researchers' records, they're the ones that are responsible for describing accurately what work they've done and what files they're uploading. All the records get DOIs. We had an integration with GitHub where you can pull software down automatically, as well as an API for accessing data. And that was really critical for a lot of the integrations that I'll talk about later on. And the thing was that lots of other institutions had this same idea. They're like, oh, we want to have an institutional version of Zenodo. And because the code was open, they said, oh, we'll go grab the code and we'll modify it. The problem with that is then you have 20 or 30 or 40 different forks of Zenodo. And it's basically impossible for everyone to update, make contributions to the same source code base if everyone has their own little fork. And so we realized this was a problem and this spawned the idea of NVIDIA or RDM. So there's currently, I think, 26 NVIDIA or RDM partners that all have this same use case. They want a repository that likes Zenodo but for their institution. And what features does it have out of the box? So it's built on the NVIDIA repository platform. So it's Python-based, which is really nice for development. It's, again, inspired by Zenodo, but it's designed to be customizable. So like theming, you can really easily theme stuff. Things like vocabularies, you can really easily customize and I'll talk about that in a little bit more detail later on. It is designed around data and software. So it's bones, it's a data and software repository. But it does support all item types. And if you look at Zenodo, they actually have a large amount of traditional journal publication type content. Caltech data was an early migration, so we had a contractual reason that we had to move. So we were early, particularly for a larger repository. Zenodo itself will be migrating this fall. So what do you get out of the box with NVIDIA or RDM? You get that user-friendly deposit form. Again, we want it to be easy for researchers to participate in this process. I'm gonna talk through some of the big features of the deposit form, but really the best thing to do is for you to try it out yourself. So if you go to NVIDIA or RDM.web.cern.ch, that's the public test version of NVIDIA or RDM you can put in whatever email address you want and it'll allow you to play around at the deposit form. So there's autocomplete in a lot of places that make things easier for researchers. So creators and contributors will auto-fill from Orchid. Affiliations will auto-fill with Roars. You can put in whatever subject vocabularies you want. You can put in award information. So if you have grants that you want to have researchers tag their data sets or software with, you can put those in. Funders are identified with Roar Identifiers. We have that drag-and-drop file uploads. It's really easy to add files into the system. Use an administrator control what the file restrictions are, how many files you wanna allow users to add, what the size you want. We have automatic DOI registration. So when that record is published, a DOI is automatically minted. We spent a lot of time making sure that all the metadata in NVIDIA or RDM goes to data site. So we've got a really nice platform for managing the DOIs. You can do draft records. So basically, researchers can start filling out a record, save it to their workspace and come back to it later. And the other nice out-of-the-box feature are these community record curations. It's basically what you can do is you can set up a community, say a research group or a lab or a division. You can identify individuals who are gonna be acting as curators for that community, and then researchers can submit their records into that community, and then the curators can review that. So it's basically a self-service curation environment. And it works really well for empowering researchers to figure out how they wanna do their data management and data curation. So how did our migration go? What do we need to do? Well, we needed to move all the 20,000 or so records and files that were in the old version of Caltech data. We had to customize the repository for Caltech, so the theming, putting in stuff like Orchids. Most importantly, we had to ensure that the API integrations that we had put together continued to work. So what was our migration strategy? So we really relied on standard data site metadata. And this was something that we had started really early on when we were working with Caltech data. The original kind of tweaked version of Zanotto had a couple of weird things in the metadata schema that we knew we weren't gonna continue moving forward. And so what we did is we built some tooling for transforming the internal schema that was used to standard data site metadata. And we were able to basically export out all the records, validate that metadata was correct, and then put it back in. And so we had already, when we decided we were gonna migrate to inviniardium, we'd already gone through the process of validating that, yes, we could pull out all of our metadata and all of our files. So we basically did that and we had an export of everything. We then imported records into inviniardium using the API. So this is not the only way you can get records into the system. So if you're Zanotto and have two million records, there are faster ways of doing it. But this was really useful for us because it allowed us to validate that the API used for generating records would work for everything we had in the collection and it was behaving as we expected. And once everything was moved over, we then switched. We swapped our name name to the new version of inviniardium. As usual, it takes a little while for the single sign-on stuff to get configured correctly. But generally for our users, there wasn't a significant change. All their records were as they were before and they could easily generate new records. During the migration process, we did some metadata enhancements. So raw identifiers didn't exist when Caltech data started. There wasn't really a place for them in the system. So we took the opportunity of the migration to do some enhancements. So we started with an automated mapping from our affiliation strings to raw identifiers using the raw retriever from metadata game changers, followed by manual validation. And that many validation step, having a human in the loop is really important because short affiliations like JPL don't automatically map particularly well, as well as you'll have lots of national institutes of health. Which one is the correct one? You kind of need to have a little bit more context to be able to make that decision. We also mapped and split free text affiliations. So in the old version of Caltech data, we didn't have the ability to do kind of multiple affiliations per contributor. So we were able to go in in cases where people had had multiple affiliations in the same field, we were able to split those up and map those terrores. We also mapped funders in terrore. And we did minor cleanup like splitting subjects, even though we had the ability to do individual subjects, people really liked putting in subjects separated by semicolons and commas. So we cleaned all that up. You can take a look at what we did in terms of the migration, our scripts are up on GitHub. We also did do theming. And as I said, InveniorDM is designed to allow you to theme it. So we made the front page look exactly like we wanted with our orange everywhere. We get theming on the landing pages. So that is really nice for identifying where you are. And even stuff like the login page. This is actually where Tommy spent most of his time figuring out how to get the, how to kind of direct people to our single sign-on while still allowing people to use local logins if needed. The other thing that we were able to implement as part of our migration was our Caltech people list. So this is a prior library wide effort to identify what people are associated with Caltech and what their orchids are. So this is useful for a whole bunch of things. It powers our metadata service feeds, which is basically an aggregated view of the metadata in our repositories, which is useful for stuff like faculty websites and reporting. But since we already had this vocabulary, we were able to add it in InveniorDM as a name vocabulary. And what that enables is then if you're adding a creator to a record, you can start typing somebody's name. And if it's in the system, it will automatically say, oh, this is probably the person you're talking about. Once you click that person, it auto fills in their family and given name, their orchid, their affiliation, which has a roar in the back end of it. So it makes it really easy to add people to records. I want to spend most of my time talking about what we do with APIs. And the three ones that I'm going to talk about are the Cell Atlas, T-Con, and Micro Publication. We actually did a CNI digital briefing that covers some of this in more detail. So if you find this interesting, go check out the digital briefing. So the Cell Atlas is a really cool book publication online project. It's an open access textbook on the microbial cell. And so it's really neat interactive. There's over 150 videos with text and narration. There's overlay, sliders, 3D protein structures. It's a really cool way of looking at cells. For this audience, the important thing is that the videos and other media that make up this site are stored in Caltech Data. And all those records were automatically created with the Caltech Data API. So when we migrated, we had to make sure that all that automation still worked. So the 2.4 version release of the Cell Atlas was the first one where we had, and Vinny, you already have been running. And we didn't really have a whole ton of changes to do to make this work. So we had to add raw identifiers, we had to make a couple other minor metadata changes. But basically, we ran the script, we uploaded the content. So this is a shot of the PDF version in the new version of Caltech Data. And the nice thing is it now has that built-in versioning. So you can look in between different versions and see what's changed on that right-hand panel. Another API automation that we have is T-Cons. This is a consortium of about 29 data collection sites around the world. They're looking at Solar Spectra and back-calculating concentrations of small molecules. All those data files come to Caltech for data creation and processing. And once they're ready, they get uploaded to Caltech Data for public access. And this is a migration we did a while back. These were previously going up to Oak Ridge National Lab. So the automation steps that we need are a monthly update. So every month we say, okay, what data can we release to the public? We pull in the metadata, generate the files, and upload that in Caltech Data. We also have to deal with a newer vision. So these are net CDF files. They're time-serious files. Generally they're just adding new data to the end of it. But if they have to go back and reprocess data, we then have to generate a completely new version. So we basically use the built-in versioning in InfiniRDM to say, okay, now we're gonna do a new version, put new files, and I'll show you what that looks like in a second. We finally have, sometimes we get new sites that are joined in the consortium, and that's just a completely general, completely new record with new metadata. So we have these three automation processes that we need to make sure would work. And they do, and they actually, we've gotten some improvements in InfiniRDM. So the first is that we actually now have a community landing page for T-Con. This is also available on their website because they like to format it the way they like. But it's nice that there's a native view of this in our repository. There's also now support for automatic versioning. So on the right, you can see an example of a record where there's been a new release. And so the versions are now listed in that right-hand panel. And there's also a warning at the top that says, hey, you're looking at an old version of this record. There's a new version, and that's really the one that you should look at. We were kind of doing this manual in the old version of Caltech data, so it's nice this is now completely automated. The final API integration is Micropublication Biology. So this is a really cool innovative journal that's published at Caltech. And they focus on single findings, so not full papers, just one single piece of information, kind of one, maybe one figure or one panel of a figure from a traditional publication. And these can be novel findings or they can be negative findings, so something that you thought was gonna work and didn't, or it can be a reproduction of something, a finding that somebody else had. And it often lacks kind of the overall narrative that you need for a traditional publication. So you can basically just say, here is this piece of data, I trust it, but I don't really know what it means in the context of a whole organism or something like that. These are peer reviewed and they've got a really nice peer review platform that they manage. The cool thing for me as a data person is all the data files are automatically uploaded to the appropriate pertinent repositories. So they look at, is it a nematode? Is it a xenophos? Is it a fly? And they put those data files in the right format and send to the appropriate domain repositories. So it's a really cool innovative publishing model. Where it comes in context of Caltech data is for supplemental files that don't map to one of those disciplinary repositories. So software or CSV files. Micro publication uses Caltech data for those files. And this is offered as part of our library publishing services. So we also offer things like DOI minting, help with indexing. And so this kind of supplemental data management is just part of our normal publishing services. And this is completely automated using the Caltech data API. Unlike the two previous examples that I talked about, the Micro Publication team implemented this independently. So basically I sent them the API documentation and they're like, okay, we got it, no problem. And a couple days later, they had records flowing into Caltech data. So this kind of indicates the general usefulness and success of these APIs because we can have independent teams just work off of the documentation and send in records. And so we were successfully able to complete our migration. So we moved all of our content by our contract deadline. We got everything over, everyone was happy. All the API integration that I talked about continued to work. And we got out of this by moving to NVIDIA RDM really significant improvements to the landing pages as well as the automatic versioning that I talked about. GitHub support is coming soon. So that we'll hopefully have in the next couple of months. And now I wanna talk about what's coming next for Caltech library and our repository systems. So I mentioned Caltech authors at the beginning. It is our large institutional repository. So it has over 100,000 works by Caltech authors. And these get a lot of use. So we have an organic chemistry textbook, which is the number two Google hit for organic chemistry PDF. It has over, it's had over a million downloads. Same with our textbook on chemical reaction engineering and a rate of change solution. So we get a lot of use of this content. And it shows the real value of hosting this content as an institution. At the same time, this content has been hosted in EPRINCE since 2004. So that's a long time for a repository system to survive. We've done upgrades and stuff, but I went to the Wayback machine to look at a 2004 version of a landing page in Caltech authors. And you can compare it to the current version of the landing page and it looks remarkably similar. It's the same little PDF logo, the same order of metadata. It hasn't changed all that much in 20 years. So the exciting thing that we're working on is migrating all of this content to NVIDIA RDM. It's gonna be a big lift. It's a lot of records. We have to customize all the, we have to capture all of the customized metadata that's been captured over the last 20 years. And we also have to fully redirect all the old URLs because we get a lot of traffic to the old URL paths. But it will hopefully allow us to build much more automation for record creation utilizing APIs. And that's something that we just can't easily do in EPRINCE. So where are we at? We're working through this. We're doing lots of customization. One example are resource types. So we've worked through all the resource types that come out of the box from NVIDIA RDM, all the resource types that are in EPRINCE. And we're also adding new ones. For example, I showed you how important textbooks are to Catholic authors. We actually now added a custom resource type for textbooks so we can separate those out from the rest of the books. And just as a teaser, this is where we're at. We're automatically transferring over records using the API. We're mapping a lot of the metadata. We're not complete yet, but we've got a good chunk of it in place. And as a result, we now get really nice landing pages that have a previewer, that have a citation that you can style. We have identifiers for, not in this record, but we'll have identifiers for people and affiliations. It's gonna be really exciting. So be on the lookout for this later this year. It's time to wrap up. So in conclusion, NVIDIA RDM is a powerful open source platform for institutional repositories. We successfully migrated Caltech data by focusing on standardized metadata. And customized resources can utilize API integrations. And that really allows research groups to display and use the data the way that they want to, but allowing the library to really be able to preserve and maintain the actual underlying data files. And we're in the process of migrating all of our library institutional repositories to NVIDIA RDM. So if you have any questions, feel free to email me. And it looks like we have about five minutes for questions from the audience. So please come up for the microphone and ask me whatever questions you have about our repository work. So this is totally not fair because I know the answer because I work with you, Tom, but if I were sitting out here in the audience and I didn't work with you, I would be wondering about the time and staffing levels and that sort of thing that's gone into this, because that's the kind of thing that I wonder about when other places are talking about the work that they're doing. So can you say a little bit about that for us? Sure. Well, as I said, we have a very small library and so we've been able to do this with not a ton of staffing. So I'm kind of the primary head on the Caltech data work. And then as I showed in the beginning, we've got a digital library group that helped out with components of it. And then we had a team from across the library, particularly with the Caltech authors migration, helping with all the metadata work and cleanup and stuff like that. So in my opinion, you can actually run your own in Viniardium Repository without a ton of specialized staffing. And there's some discussion earlier of you like $75 million budgets for research data services and we are running Caltech data on a very, very, very minimal fraction of that. And I think folks can chat with you more if they want more specific numbers. We've got a question over here. Matt Mernick. Oh, sorry about that. From the National Center for Amasberg Research. So thanks for the talk. Could you comment on the GitHub integration a bit more on the soft rock archiving part? Is it similar to the Zanotto GitHub plugin, I assume? Exactly, yeah. So what we had in the original version of Caltech data was exactly what Zanotto has. So it's a, basically you can log in with your GitHub credentials, you check what repos you want to preserve and then it'll automatically generate the DOI for you. What we're actually gonna launch in, hopefully a month or two, is a little bit different. So we're actually gonna be launching a GitHub action. So basically something that you can configure on a per repository basis. And then basically GitHub triggers the push into the repository. And that allows a little bit more control over metadata. We also have been able to pull in a really comprehensive code meta and citation.cff mapping. So basically if there's metadata in the repository or in the GitHub repository, we'll be, we are able to pull that out and map that into the NVINIRDM schema. So it should have a really, should be a really nice clean interface for doing software preservation. Thank you. Sorry, hey Josh from AWS here. I was just curious about the size of the infrastructure that you're running. That is an excellent question and one that I didn't get into too detail with. So storage is the big challenge for us. So all of our storage at the moment is on AWS S3. So in NVINIRDM you have the option of having local storage or S3 as the destination for the files. That obviously has costs. So the way that we've managed to date is we give researchers 500 gigs of free storage and then we do a charge back model if researchers want to share more files. We are looking at options for kind of local hosting for much larger files. We've done a pilot with the open storage network through some exceed access support for the larger files. But that's something that we're really actively working on figuring out how we'll manage the terabyte to petabyte scale data that is floating around campus. Got it, thanks for that. And then in terms of the compute. The compute, because we're a small repository, it's basically set up as a single AWS host. EC2 instance, you can deploy NVINIRDM a whole bunch of ways. There's a Kabunty's distribution that CERN uses and there's a couple of other ones that different sites use. But for us, because we're a smaller site, we don't have a whole ton of traffic or a whole ton of users. A single on the EC2 instance has worked well for us so far. Got it, that makes sense. It's just one more real quick one. How many users would you say you have on average accessing the system maybe per day? Oh, that's a good question. I don't have a complete number. So in terms of Caltech users, it's in the handful per day, because we're a small institution. Now from the data access and download, that's a much larger number. But I don't have a good number off hand. I have the numbers, but I just don't have them in the top of my head. Got it, awesome, thank you. Thanks so much. That is the end of our time for this session. But because this is the weird half hour session in the hour slot, I'm gonna hang out around here for a while. I also have in VineyardRDM hex stickers. So if anyone wants hex stickers within VineyardRDM, just come down and grab some. Thank you all so much for the opportunity.