 Oh, there it is, 305. Okay. I was just gonna say we'll start at 305 p.m. So welcome everyone. My name is Heidi R. B. C. Kellam. I work at the University of Iowa. I'm a US ETA board member. And so it's my privilege and honor to introduce our guests who are here to speak to you today about DOIs and your thesis. Catherine Johnson and Tom Morrell from the California Institute of Technology. I'll let them both introduce themselves and their roles. If you wanna put your questions in the Q&A, I will be monitoring that for them. And then any chat among participants, feel free to put in the chat. Take it away, Tom and Catherine. Welcome. Okay. Hello everybody. I am, as Heidi said, I'm Catherine Johnson. I am a Caltech and so is Tom. I am the thesis librarian and officers librarian. Get a little confused because my name, my title changes every five minutes. Tom is our data specialist and he is basically my savior if you wanna call that when it comes to a lot of these technical things. So we are going, I'm gonna do the first part and then we're gonna switch over. So just sort of give you an idea of what Caltech is about. It is an independent, privately supported institution with 124 acre campus in Pasadena, California. So Rose Bowl, Rose Prairie Country. And we have approximately 1,000 undergrads and 1,250 graduates and so forth. So we sort of flip what most universities do. We get about 200 to 250 PhD theses each year. So it is, like I said, it is a small institution but it's heavily stemmed both for undergraduate and graduate students. We have six divisions, academic divisions. We have biology and biological engineering, chemistry and chemical engineering, engineering and applied sciences. Yes, there's a theme here. Geology and planetary sciences, humanities and social sciences but only social sciences produces PhD candidates. And physics and mathematics and astronomy is the last division. So like I said, we are heavily stemmed and it does affect what kind of dissertations we receive. So the bigger question is why do we use DOIs? Well, open sharing initiatives are spreading not just at Caltech but everywhere. Open data, open science, open access, open source, all those things. And the vice provost for research at Caltech heavily promotes open science and they wanted theses to be open. And so they established a dissemination policy and we had that in place in March 1st, 2017 to encourage open science. And you can see in our slide why we decided to do DOIs because they provided great way to promote open sharing. You don't have to memorize these really long URLs. You just memorize the short DOI, digital object identifier, very short string, easy to copy paste and you move on. And it also makes sure that for us that the students get credit for their research because a lot of our students actually publish ahead of time and that publications are included as chapters and dissertations. So it promotes scientific reuse of your work. So if somebody wants to share, if you wanna share your work then somebody else can give you credit and make sure that you get credit. I'm repeating myself, sorry. So one of the things that we wanna make, what we do is add the DOI to the ORCID record not just in our repository but also through ORCID. And I hope there are people from ORCID here today. If not, then they're getting free promotion here. We also have found that by doing this we do get more citations and people can monitor their results. One of the things that I found is that since we started using DOIs more like across the board with our dissertations is that we get a lot of requests from people for permission to reuse some of the students' content or even for a while we had all our dissertations were restricted to campus and we would get requests to open them up to public use. And like I said, it's made a big difference having the DOIs, it just does, they are more visible. Next screen. Okay, next one, I think we're repeating here. Okay, so I think we're can, is it rolling by itself, Tom? I can't hear you. I stopped it. Okay, well, we're supposed to be on number three, I think. Maybe I've made it through. Keep going, one more. All right, well, let's just, that's a good place to be. So when we first started DOIs, that was in 2016 and we went through the California Digital Libraries EZID program and at the time it was pretty straightforward. They could use all the metadata that we wanted to add to as identification in the DOI records and it was affordable, so we decided to do that. Then in, sorry, that was in 2013, but in 2018, the CDL, the California Digital Library decided that they were only going to allow the University of California campuses to use their services. So we had to quickly find an alternative. And Tom and my former colleague were the ones that spent a lot of time trying to figure out whether Crossref or Datasite would be a better option for us. And so in 2018, after going back and forth, we decided that Datasite would really be our better bet and I don't mind explaining why that was because I think it's still the situation where Crossref works well for journal articles, that's what they're designed for, but for dissertations, their metadata options just weren't there. And Datasite was a lot more flexible and we were able to include all the metadata that we had so carefully added to our, how many DOIs do we have already? We had over a thousand DOIs already residing in EZID's database. So we didn't want to lose any of the work that we had put into EZID. So when we went with Datasite, we were assured that we could keep all the metadata and we've actually increased the kind of metadata that goes into the records since then. And Tom will talk a little bit about that. I'm gonna go to the next screen. Catherine, before you move on, there is a question. That's okay. So question for Catherine, you mentioned that adding DOIs to ORCID increase the visibility of the ETD and that people could view their own results. Where do your alums view their own results? It's more that they can view their results on in our repository. They can see statistics that will show them how many views they've had, views and downloads they've had in our repository. And Tom, you may wanna add something to that. You might, I don't know. No, I mean, I think we've got usage statistics that are built into the repository and that's accessible to every user. Yeah, and you don't have to be a specific person. You just have to figure out what to look for to find the stats. Okay. Okay, so like I said, when 2013, 2018, we switched, I'm trying to find what screen I'm supposed to be on here, so guys. All right, so going back to what we had in EZID, we had 10,000, I'm sorry, 1,090 DOIs assigned to dissertations alone. And then another 97 EZID DOIs belong to Caltech's data centers. We have several Caltech data centers and they've increased since then, but at the time we had two, I think it's LIGO, so Long Range Interferometer Gravity Observatory. So they essentially look for gravity waves in space. And then the other one was the, what is the name of that one, Tom? Southern California Earthquake Center. No, the one about the air pollution. Oh, TECON, that was after. TECON, okay. We have lots of DOIs. Yes, we have lots of DOIs. So like I said, we decided to go with DataSite just because they were a lot more flexible and allowed more metadata. So right now we have, I'm on slide nine, Tom. Caltech Library has the ability to generate DataSite DOIs using one of several work flows. So we have to understand that we built up to this point. Originally, we were using DataSite's web form and doing them manually, basically copy pasting from the thesis metadata records to the DataSite web form. And very quickly, we ramped up to, actually Tom created scripts where we could just essentially just give the system the record ID number and then it would grab the metadata that needed and create a DataSite record. And then after that, so the next step after that was to re-import the DataSite record information back into our thesis record. And that's, so after that, the next step after that was essentially to be able to do that in batch push because at first we did it one by one. And then now we're at the point where we can do it in batches and do automatic updates. So it's the same process whether you're creating a new record, a new DOI record or if you're just updating the DataSite DOI record. It does the system, the script doesn't care but it will tell you if it's just updating it. So it's not like it's creating a brand new DOI every time you do an update which is fantastic. Catherine, Heidi again. Sorry, before you move on to your next slide, a lot of questions about the ORCID, the relationship between ORCID and DOI that you mentioned. I actually also have a question myself. So I'll read this question and then sort of add what I as a grad college staff person would like to know. So question was, do all of your students have ORCID IDs or do you just search out to see if a student has a registered account and then link the DOI to the ORCID? I would also like to know how the DOI is linked to the ORCID and I apologize for being obtuse that I don't know that. So, cause to me the ORCID seems like it's the unique identifier for the student. The DOI is the unique identifier for the object. What's the link and then how does that look? I'm gonna answer the first part and Sam can, I mean, sorry, Tom can answer the more technical part of it. Okay, so the first part of it and not every, sorry, we have the mailman coming, not every grad student has a DOI. We find that it actually varies by division. So for example, the, right, so the ORCID, yeah. So who has an ORCID really varies by division. Some of the divisions have gone all out and said everybody must have an ORCID. But Caltech is one of those places where you can't tell people what to do. So it's, we certainly don't have the power to do that. So we sort of wait for them. One of the things that we do, we do have the chance to talk to grad students in various settings throughout the year. And one of the things that we do is tell them that when you create your ORCID, please make sure that your affiliation is publicly available. Because most of, if you've seen how ORCID records are set up, you can actually minimize the amount of information that gets shared publicly. So we asked them to make sure that their name and their affiliation is publicly available so that we can search. But what really helps us with dissertations though is that we, in the past two years, was it one year or two years, we added an ORCID field to the author field in the thesis metadata records. So you have to understand that our grad students, actually not undergrads too, but our grad students in particular create their own thesis metadata records. But we don't do that for them, they do it. And we just, we vet the record to make sure nothing's missing, that everything is accurate. So they have to add not just their names, but also their email address and their ORCID information. The ORCID information does remain public so people can use a reader from outside, could go into the metadata record and see the ORCID, then they could click on the ORCID link, which will take them to the ORCID record in the ORCID database, and then they could see what other things they have there. Now, you can talk about the DOI part, Tom. Yeah, so the DOI part is actually pretty simple. Once you've gotten an ORCID in the metadata record, so basically once the student has provided that, we basically just include that in the metadata that goes to data site. And once that's part of the metadata, then that gets the, then that will automatically trigger linking that back into the ORCID record. So assuming the student has authorized data site to have access to their ORCID record, which ORCID has a whole process for that and they'll send out an email and a little notification stuff. But once they've said that, yes, data site is a trusted provider that can add them to their records, once the DOI is registered, that automatically gets added to the record. So basically the key thing for us really is having that, having a collection point in the deposit form where the students can put in that information. It's not as fancy as some of the, the O-Walk authenticated ORCID stuff because our repository doesn't need to really support that. But even just having a form where you can, have the student drop in that 16 character ORCID, we've seen that to be really helpful both in adoption on the thesis side as well as in the data repository side. Okay. And just so you all understand, we Caltech likes to go its own way a lot of times and we actually use Eprints as our main repository for our text-based content. And the reason we have Eprints is because we have had a repository since 2001. And at that time, a lot of the more prevalent repositories just did not exist at that time. So we went with what was out there. And one of the things that we particularly appreciated about Eprints at the time was that they were open source. So we were able to manipulate the metadata fields, basically enhance them, it's not just get registered, just add more stuff to it. And the ORCID fields is one of the latest examples of what we've added to it. And the other repository that Tom is responsible for is called Caltech data. And that's where a lot of our supplemental material goes. And that is a... Tom? It's a new video-based repository, which is also open source. And allows us to do a lot of fun things with the DOIs. Which we'll talk about in a bit. Right. Okay. So before I get into a demo, because there will be a demo shortly of how quickly we can do this. Like I said before, we do update not only, we don't only create original DOI records in data site through a script, but we can also use the same script to update the data site DOI records when a change is needed in a Caltech thesis record. And I actually found one that needs to be done. So I will show you what kind of process I would go through to do that. You can see how quickly it's done. All right. Tom, do you wanna talk about the automation slides 11? And then we'll do the demo after that. Yeah. So our motivation for this project was really, we have a lot of metadata about our VCs in our E-Prints repository. That basically it's all in XML. And we wanted to basically give all of that to data site because once you make your metadata available, other people can do cool things with it. And we didn't wanna have folks cutting and pasting stuff. Oh, sorry about that. So we basically need to do the key part of this and the hardest part in any sort of DOI automation is this metadata transformation. You have to go from what metadata you have in your repository system to a data site metadata standard. And this is what we spent most of our time working on. And all of our code is available, stuff you wanna, if you wanna take a look at what we're actually doing in terms of the code, it's all up on GitHub. And we started by basically using just the upload form that data site provides. So if you have a data site XML or a data site JSON file, you can upload it in their web form and get a DOI. So the kind of the minimal thing one needs to do to get into the DOI is to do a metadata mapping. And the question that often comes up with folks is why didn't we use a repository plugin? So E-Prints, it's an older repository system, but there is actually a DOI mything plugin. Why didn't we use it? Because we wanted to have all of our metadata mapped. And this is particularly critical for the thesis because we have a lot of unique fields that we wanted to control how they show up in data site. Our current version has 43 different data site metadata fields. So we map a lot of different stuff and there's custom handling for stuff like orchids and dates, majors, minors, what groups are associated with a thesis. And we have a lot of records that we care a lot about. This is Andrea Gens's thesis, she was the 22 on a Nobel Prize Laureate. So we have a lot of records that are really important. We wanna do a good job in making sure that they're accurate and correct and as full as possible. And in here, you can see that we've got her orchid, we've got a raw identifier, we've got specific subjects, specific dates. And so all the stuff in how it maps is really custom for our use case and how our repository is configured. To give you a little more detail what that metadata mapping looks like. So we have kind of fields and e-prints that we can then map to data sites. So creators and contributors, those map basically directly, but we wanted to add an affiliation. So we know everyone that's gotten a thesis in Caltech is affiliated with Caltech. Mapping some of the mapping is simple, it's like changing it. We have it labeled as given in our XML, we map it over to given name and data site. There's some stuff that we always add that were the publisher of the thesis. We had version information, we had publication years. We mapped some of the dates. So we have, for some records, we know when they were approved by the grad office, we have some that were digitized later, we only know the year or we only know one of the dates. So we have kind of custom logic for handling those specific dates. We put the thesis type, abject keywords, we put the majors and minors in there, we don't have to fill in any funding information that we've collected, writes information related to enterprise, so we can link in other DOIs. So this is, we have a lot that have connections to existing published papers or Caltech data DOIs, which I'll talk about in a bit. So we have really control over what that record looks like. And this is one of the reasons why we like being able to do this as a custom metadata mapping. Technically, we started, what's our technical approach? How do we do this? We started with an existing Python library, so I code a lot in Python, so if I wanted to work in Python, this one is one that helps manage translations of the data set metadata. And it uses a JSON schema to generate that final XML version. Okay, so what does it does? Basically takes that metadata record from EFINCE XML, pulls it into Python, does all the field mapping, and then we get that data set XML. So that was the first, just get the metadata in the right shape. Once we verified that that worked by manually doing everything, we basically added in automation. So we added an automated downloader that will automatically grab the thesis metadata. And then we added an automated upload, so it can send it to data site automatically. Then we added in basically adding that DOI information back into the EFINCE repository. So we added this, we built this up in stages over time, as we tested and verified and see that it was working. Okay, so now we're gonna do a demo and Kathy's gonna show you how this actually works. Yeah, okay. I am going to screen share real quick. So hold on. I'm gonna show two records first. So, I gotta figure out how to do this, so hold on. Chrome tab, okay. There we go. Okay, so the first one I'm gonna share, I'm gonna share here is actually a thesis record that was just very recently, it was recently deposited, so 2021. You will see that our thesis records are very plain and we've done this on purpose because we don't wanna spend a little time just doing the pretty bells and whistles. We want the functionality and everything else. So the reason I'm showing you this is because I want you to see what all we can do with our records. So you'll see right here you have the title, the abstract, the type of record, the division, major options, so on. The funders is something that we ask the students to add to but also really URL. So you'll see a lot of these students, for example, have published content. So they will add URLs or DOIs in this case for articles within their dissertation. So it has the DOI for the actual article but we also have at the very top scrolling back up as part of the citation, we have the DOI itself. And then down at the bottom we have the ORCID record right here. So if somebody were to click on this ORCID, it would take them to the ORCID record where all the students' publications are listed the only control that we have over is whether or not the dissertation gets uploaded. But what I wanna show you here is how easy it is for us to actually update a record. And then, I'm gonna stop sharing because I wanna switch screens. How easy it is to update a record and then update the DOI record as well. So I found a record that needs to be updated. So I'm gonna move to that one. And that is this one right here. So you will look at this and you see that the title has a typo in the very first word. So I'm going to go down to this record. Whoops. Details, edit item. And I'm going to make the correction. UI, save and return. And what I need actually is this number right here. 6772, that's the record ID number. And now I'm gonna share, sorry, I am going to share the other screen. It's a little confusing, I know. Okay, this is where I go. It's an app on my machine. And we just needed to upload the software. So this is where I need to go. And I'm gonna type. So I'm telling the system, use Python, go to Caltech thesis with the Python extension. And then I'm going to tell it to, oops, sorry. And it doesn't matter whether the record already has a DOI or whether I'm just updating it. So 6772, that was a record number. It's gonna ask me to log in to the system, hopefully. Is it, okay, there we go. So obviously I have to give it permission, okay. So it just tells me we have updated the metadata with data site. So data site now has the updated record and this information right here, which is the DOI that data site originally assigned to it go back to the records. I'm gonna stop sharing this screen and go back to sharing, where's my Chrome tab? Where's the silly thing? All right, there we go, open up. Oh, sorry. Okay, so this DOI has not changed. So all that has changed, the only thing that changed during this whole action was that the data site DOI record now has the correct title included. So that's how quick it is. And the only, so if I were to do a batch, what I could do is download a list of ID numbers and then I could run that text file, essentially with the numbers in this same process. Instead of saying IDS, I would just say, go ID, go find this file and run it. So how long did it take you to, so backtrack. One of the things that we did was update all the old records, basically back to the beginning of time of what we are heading database. And Tom did it essentially in two batches and that was really for financial reasons, more than anything. So talk about that, Tom. Yeah, so if we need to batch, update everything, yeah, it just takes a little while for the script to chug through stuff. So it takes a few hours and you're just gonna say, I would like to update everything and it goes through and it goes DOI by DOI, send it over to data site and updates everything. So, yeah, we've now added DOIs to all of our theses and as we get new ones in, it's now part of the just standard approval process of once the theses are ready, they get a DOI and off they go. So I wanted to talk a little bit more about what we've done in the metadata. So we've kind of continually added updates to keep on top of what data site adds. And we've switched out to a completely JSON only workflow. So instead of dealing with just an XML file, we actually work just in JSON and then send things to the REST API. What is changed in the metadata mapping? Well, the biggest thing that's changed for us is identifiers. So more identifiers, if you haven't heard of them yet, they're a fairly new open identifier for research organizations. So these are top level organizations, universities like Caltech. You can search for years at roar.org and you can think of them like ORCID. They make it very easy to distinguish different names. So for example, you can think of Institute of Health, but what Institute of Health? The one in the US, the one in Germany, the one. So the VORs are a unique open persistent identifier for a specific organization. FECIs are really easy use case for VOR identifiers because we know they've all graduated from Caltech. Everyone is affiliated with Caltech, who has a thesis here. So that was actually a very easy metadata mapping addition. What's the interest? Tom, there's a quick question. Where do funds for minting our DOIs come from? They come from the standard library budget. So the way that data site works in terms of the pricing, there's a kind of a you get, once you get over a couple hundred DOIs, here in a kind of a pricing bucket where you get up to, let's get the exact structure, but you get a bucket of DOIs, basically like 5,000 or 10,000 or something for one price. And so basically we've been able to structure our, when we do the DOI minting for the old thesis, to a point when we know we're gonna be over, we're gonna be in one specific pricing bucket. So for the legacy thesis, it's not, you know, the new thesis, we have to give them a DOI as they get approved, the students want it. But for the old ones, we don't have as much time pressure. So basically we can schedule when we mint all those DOIs, so we can control the costs. And data site is more flexible about that because they do have the kind of bucket price structure. Crossref is more of just, you know, you pay for DOI. I think Crossref also has some way to do backfiles, but it hasn't, it's been reasonable for us to work it into the regular standard library budget. Even though we're direct members and there are actually options, if you're looking at a consortium membership, they're actually a little bit cheaper, but yeah. So I think in general, the costs that we put in the DOIs are way less than a lot of the other library services that we pay for. So in my mind, it's a pretty affordable win. So, but why do you care? This is, I just said, we're spending money, we're spending some of the library's budget on this. What's the impact of doing all this work and getting the metadata to show up? Well, adding the raw identifiers, which is one of our most recent additions, allow those thesis records to be easily associated with Caltech. And when you register that metadata, cool things happen. So this is a view in data site commons, which is data sites new search service. If you go and search for Caltech, you search for a raw identifier, you get a nice graph of all the thesis that we have. They've got a nice little data. You can say, oh, look, we have 2016 was a big year for theses. Or no, 2013 was a big year for theses? Yeah, so you can basically, we've got all this nice stuff of, we got some graphs, we got the work types. You can really allow people to use your data and get a view of your theses without having to do any of this work. I don't have to make this graph, somebody else did it for me. So that's a really great addition. The other thing that I wanna talk about briefly is thesis supplements. So I run Caltech data, which is our data repository. And DOI augmenting is a critical part of our functionality of the data repository. On the back end, it's JSON metadata. So it's very similar to what data site wants. And we send all of our metadata when we register the DOIs. It actually uses that same Python library that they did the other automation for. So I spent a lot of time with this Python library. And the thing that's relevant for this group is we host these related supplements. And what is the workflow? Well, the benefit of kind of splitting off the thesis supplements from the main thesis is it allows students to upload their supplements at any point. So Caltech data is available to everybody on campus. So let's say students writing the first chapter of their thesis, right? They finish it off and they have some data files they used when they were doing their analysis. They can go to Caltech data, upload those, get a DOI that they can put, drop in their thesis text, and then they can embargo the data files. They want people to see them and it's fine. They can say, okay, I'm gonna graduate end of the year, I'm gonna embargo it till then. So they can generate, you can kind of upload strategically as they go as they're writing their thesis. And then once they get to the actual submission process, they can drop those theses DOI and the thesis metadata. What we've added is we've added an automatic linking. So every day we go and check with you, what are the new theses? Are there any Caltech data links? And if so, we'll add those in. So somebody, if they find the data set, they can go and see the theses. If they find the theses, they can go see the data set. So it puts the kind of content types that are the best fit in the right places. And it makes a nice workflow for the students as they're writing the theses. Another example that I like to talk about is some geology theses. So a lot of the theses, the old printed ones have the kind of main bound theses and then they have cool stuff, cool maps and diagrams and stuff in the pocket in the back. And these were partially digitized but not necessarily consistently and they were kind of just kind of attached to the thesis record. So a project that we did, we looked at, now these are the examples there. So we did some metadata enrichment. We went through all of the bound printed theses. We have some really amazing library staff that were able to do this. And they looked for things like keywords, descriptions, if they're what the plate was as opposed to what the whole thesis was. Any dates that were present on the plate and most importantly any geographic coordinates. So like for this plate, it's associated with the thesis with these keywords. This is specifically a geological map of the Del Valley area. Different theses will have possibly different locations, different maps. We've got a date that is specific to the map as well as the coordinates. And so we took all these and put them into Caltech data as kind of thesis supplements. And then what we were able to do because we make all that metadata available, all those geographic coordinates are now available, I can then pull these out and drop them on a map. So now I can take a look and say, okay, where have theses been, where's thesis work done from our geology division? It can go over time. So we've got stuff from the 20s all the way to the current 2020s. And say, okay, well, what is, what work has been done in the LA area? And then for each of these boxes are associated with the map. And if you highlight one of those, it tells us what the title is, what the author is, what the year is. And if you click on it, it'll link via the DOI into the item in Caltech data. And if you look at the, drop in the preview version, and that's now specifically, oh, look, here's the Channel Islands with this nice cool hand drawn map. And from there, you can then go over and look at the actual thesis record. So you can go back and forth between the actual thesis record and the supplements and where those supplements are occurring on a map. So what's the impact? With DOIs, we've able been to uniquely identify both theses as well as the supplemental data for theses. And by fully registering our metadata, we can make our content visible either for the thesis themselves or for the supplements. And that is also bringing us up right up to time. I was just gonna say you two are perfect with the questions and everything. Look at this, yeah, I'm impressed. Thank you. So something that this, go ahead. I was just gonna say, if somebody wants to contact us afterwards, since we're running out of time, you are more than welcome to do so. Well, I wanna thank you both for your time and this invaluable information. You've cleared up a lot. I think and definitely opened up a whole new world of possibilities for the things that we've been doing at Iowa. I'm sure our librarians, like I said, I'm in the graduate college. I'm sure that they are well aware of some of these opportunities, but I hadn't known that you could connect the DOIs to the Orchid ID in just other ways in which you have showed us kind of your process, the automating that you've done in the tools and systems you're using to try to economize. Thank you for all of that. Does anyone have a final last question or do either of you have any final comments that you wanna leave us with? No, but I do hope that you came away with something useful for you. We do realize that we are sort of the odd man out when it comes to repository software and stuff like that, but we don't think that any of what we do or are doing, we think it's applicable to other repositories. It's not, that's one of the reasons we use open source. Exactly, right? So if you can get metadata out of your repository in a standard fashion, then doing the types of scripts and the types of automation that we've shown here really is not that complicated. And this is not like a project that we put like a ton of staff on it, most that basically me and Kathy. So this is achievable for, I think, anyone who's running a thesis repository. Well, we'll leave it there. Thank you both very much. It looks like we've gotten a note to head over to a new session, which is starting. So thank you again. Okay. Thank you, everybody. Take care. Bye, everyone. Bye-bye.