 My name is Stephanie LeBou, I am the data science librarian at UC San Diego and I've brought you all here today to talk about how to make research repositories more robust with all the big buzzwords. So I do want to briefly distinguish between AI, which is artificial intelligence, and machine learning, which is ML. So they are often used interchangeably but there are differences and for this talk I'm really going to focus on machine learning, which is a type of AI method or approach. It's a little bit more restricted in terms of its input data and its method structure, which means that it's potentially a slightly better fit for traditional research data repositories. So talking about data and code and things that we can actually put into a repository and not need necessarily Amazon web services to hold. So it's a bit of a more established use case and we can learn lessons from working with machine learning research outputs that then we can apply to AI components. And so as I'm sure everyone here knows there's really no escaping machine learning. It is more popular than ever. We have a rise of accessible and relatively cheap computing power. There's really broad social interest in using ML and AI and there's tons of government and funder directives to share research output, a lot of which is going to involve some aspect of AI or machine learning. And so from a repository of protective there's really no getting around the need to start thinking about if you're not already thinking about things like machine learning. However, we don't really know what we know and what we don't know. And so before we can talk about best practices and what research data repositories can do to make adjustments, we really need to know what people who actually do ML are doing and how they're working with their data and how they're working with their code and how they're thinking about and documenting their research. And so what is being shared and where and how do our current schemas and infrastructure help or hinder find ability and reusability of these incredibly valuable research components. And when I'm talking about research components, there's a lot going on in a lot of this. Machine learning is pretty complex. And again, we're mostly focusing on ML. So you can think of AI as a scaled up version of this. And so in ML, we might have our training data. We might have test data. We've got code pictured on the screen is ML schema, which is a proposed ontology for ML. It includes tasks, models, algorithms, hyperparameter settings. There's a lot going on here and you don't need to know all of it. The takeaway is that this structure, it's great for documentation. There's a lot of good information here. This does not fit with most research data repository schemas. There's a lot going on here that is hard to capture in existing fields. And so we need to think about how can we take all this information and put it into research data repositories without blowing everything up and having to build all our repositories from scratch, which nobody is going to do. So to get a better idea of what's going on and how ML really fits into our existing structures, myself and colleagues at UC San Diego library looked at eight different repositories that contain ML related content. So we put these into buckets of specialist repositories and generalist repositories. Specialist repositories really specialize in machine learning. So they're really built for and by machine learning practitioners. And we're talking about usually sharing data or code or both. And so this includes repositories like the UC Irvine machine learning repository. Great name tells you upfront exactly what it is for. And then OpenML and Kegel, which if you haven't used either of those are ML and data science platforms that do have a repository component. Now conversely, generalist repositories are the ones that we're probably all thinking of. This is FigShare. This is Zenodo. It doesn't really have any domain focus necessarily. Any domain or discipline can deposit here and it accepts all different types of formats for deposits. And so this is probably more appealing or at least more familiar to researchers who might not be ML practitioners 100%. These are going to be political scientists who are starting to use machine learning or other domains where machine learning is all of a sudden an option because they have access to more computing power. And so there's also a lot in these generalist repositories obviously that's not machine learning. And so it's an interesting comparison to look at how these different research outputs fit into these different repositories. And so this of course isn't exhaustive. There are a ton of machine learning repositories that are springing up all the time. These are really ones that we wanted to use because they're bigger. They're a little more established. We talked to some data science folks at UC San Diego. These are the ones that came up. And so we want to know what's working well in these and what probably needs to be adjusted. And so for shameless self-promotion, we do have a paper coming out soon, Journal of eScience Librarianship. To summarize, there's a lot of machine learning content in these repositories and it's just increasing, especially over the past 10 years. I'm not going to dig too deep in this talk about the actual content and file formats, but it's really interesting to see what people are sharing that they have self-described as machine learning. And one of the takeaways is the files a lot smaller than you think. So where is all this big data going? Where is it being stored? Because it's not in here. But what I really want to focus on for this talk is more of the structure and the schema of the repositories, independent of what people are actually putting in there in terms of file formats. Because those are things that repositories can make changes to in terms of the infrastructure and what kind of documentation we're thinking of for data deposits. So our group settled on about half a dozen recommendations for areas of improvement and a lot of this is related to what metadata fields are needed and honestly what should be mandatory and stop giving people an option to opt out. But I really want to focus on two of the main themes that may not be unique to machine learning research outputs, but they're really necessary to consider. And those are the importance of rich content specific metadata and the absolutely crucial part of access at scale. Okay, so first things first. Rich and content specific metadata, don't panic. I'm not going to ask for, again, a whole overhaul of a repository, but really thinking about ways of searching and findability and reusability. So the way people search for and think about ML related outputs, code and data, it's really different from the way generalist repositories make this data public facing and available and downloadable. So for instance, platforms like Zanotto and Figshare, they're emphasizing access, right, and license types and file formats and domains. And those are all incredibly valuable, right? That's really crucial in terms of reusability. You want to know that you are downloading something that is a CC by license, right? It's not going to be restricted. However, I work a lot with data scientists in a teaching and learning context, and they do a lot of machine learning. And I get a lot of questions for data sets for teaching students and for students to just get practice with. And it's never couched like this. It is along the lines of I'm looking for data. Don't care what topic. Don't care what domain. Don't care what discipline. It should have at least 100 variables. I would like a mix of numeric and categorical. I'd like to be amenable to a time series analysis, and it should have at least 10,000 instances. That's really hard to get out of a generalist repository unless you're already familiar with a very specific set of data. And so that findability piece is completely different. So what is searchable in specialist repositories is far more focused on in tabular data, for instance, how many rows, how many columns? Is this going to be used for regression? Are we going to be doing a clustering analysis? Do we want to do multivariate stats? Are we going to be doing a time series something? So all those questions about type of domain, regrettably, all the questions about licensing, not really in specialist repositories. The focus is much more on that kind of descriptive metadata. What is going on in that content? And so again, both interfaces are useful. It's that the use cases are different and the way that researchers are searching for and thinking about data and code are going to be different. And so what are some of the small steps that generalist repositories can take? I'm a big proponent of generalist repositories can learn from specialist repositories and specialist repositories can learn a lot from generalist repositories in the long history of documentation. But for generalist repositories like institutional repositories, ideally, and it would be a big ask, is additional metadata fields that really get at that content specific metadata? I know that's not going to be an option in most cases. So a nice workaround and one we came across in a few of these is really structured and standardized read me templates for ML. It doesn't mean you have to add all these extra fields that are specific to this one method or approach, but it does mean that people who are using that method or approach can put down that very specific information. So number of rows, number of columns, if you're doing a certain type of analysis and all the information that someone else who's doing machine learning would need to know. So there are structured templates out there. They're not necessarily as common. There's not a one size fits all, but it is at least a start. And again, I've been talking mostly about data, for examples, because it's something that is going to be a research data repository, but the same is true for code. So for instance, someone who's looking for Python code and I'm looking for all the Python code that has used a certain package that is popular in machine learning, because I want to, you know, test my model using those same workflows. Again, that's maybe not going to be findable in a generalist repository beyond looking for file format if that's available. But if there is a detailed read me that goes through all the different Python packages that are used in this project, that's at least a step forward in terms of surfacing that information and technical details that will be needed to do machine learning. So that thinking about how folks are finding and searching for these different components is kind of one aspect of this, right? If we have ML content in our repository that we want people to find or if we want people doing ML research to say, I do want to put my stuff in this great repository because it's a good fit for my work. That's kind of one use case. The other thing to think about is there's all kinds of stuff in repositories that maybe isn't classified as machine learning, but would be great for machine learning or AI. And this is just if you have an image collection that is culturally significant, that might be a great computer vision project, right? If you have, we have in our repository, you know, 100 years of fish catch statistics from the California Board of Fish and Game. And I keep telling people about it because no one is excited about it because it doesn't sound interesting, but it's got a time series component. It has geographic differences. You can look by species. You can look at prices of stuff. It is tailor made for a machine learning project, but it's under a cultural heritage subsection collection that no one's going to be excited about. So we want to make all that available to people and for people to get excited about that. And we want them to use what we have in our repositories because especially for libraries, this is good stuff. This is often publicly funded, federally funded. It's going to be well documented. We've got text and images and numeric data. And again, I cannot emphasize enough how important having that good documentation and a commitment to preservation is, which is not going to be the case for all repositories. But here's the thing. It's not that easy to just tell someone, we've got all this cool content. You should absolutely use it for a number of reasons. Access at scale and how easy that is is really, really important for people doing machine learning, research and AI. So if it is too hard to download in bulk or even to web scrape or access programmatically, people will look elsewhere. They will move on. It doesn't matter how cool it is, how well documented it is. No one is going to sit there and manually download a bunch of different files. They're just not. And again, maybe this doesn't apply to all repositories. Maybe it's not the case that everyone needs to consider this, but it is worth knowing that the ease of access can be a deciding factor. And that's something to think about early on rather than 10 years from now and wonder, why is no one using the excellent content that I have? So here's what this looks like in practice. So this is an image dataset from the UC San Diego Library digital collections. It's got multiple objects. Each of these objects has multiple file images, and each of these have to be downloaded separately. This is another instance where this would be fabulous for a computer vision project. But it's really not easy to download and bulk. No one is going to sit there and manually download hundreds and hundreds of image files, all of which when they're downloaded will be named like b7x2935, right? That's not useful. That's not helpful. Now conversely, on the other side of the screen is an example from the UCI machine learning repository. There is one download button. And that is going to give you a zipped folder that has labeled subfolders that contain labeled images. And that is again, one download, you're going to get everything you need. It's far less work for the end user. So if you had a choice, you would probably choose the UCI machine learning one. I would. And I know better, right? But I would choose that one because it's easier to access. And again, there are valid reasons why our repository is organized the way it is. And that is based on internal workflows. We have discussions with our data depositors and the purpose of the deposit, which in this case wasn't to have an image data set for machine learning purposes. That's fine. But it does mean it's closing off future opportunities for researchers to use this image in things like machine learning or an AI. And again, this is good content that we want people to use when we talk about garbage in, garbage out, and why are the models not doing well? We've got a gold mine of data, but we have to make it easy for people to access it. And so inevitably, this comes around to APIs. I know it's not everyone's favorite topic. But an API allows programmatic access to the metadata and the data. And ideally, this also means that you don't have to change your interface and re-bundle everything, right? You can push that off to the end user. And so we actually don't have a public facing API. If we did, someone could go in and say, I do want every single item from this collection. We don't have to change our interface that is public facing. The user has to do that extra bit of work. And in theory, they would get everything they need, which is as easy as a single button to download. There's a lot of questions that remain here, right? What format? Do you return to people? How do you want to nest your metadata in a hierarchy? Do you give them a JSON format? All kinds of stuff like that. And it's not something that you can magically go back and ask the IT people to make. Make me an API, which I keep asking our IT folks. It's not that simple to make an external facing API. It takes time. It takes resources. It takes maintenance. And if you do it wrong, you've set yourself up for potentially a security risk. So it's not nothing. I do want to be clear that I recognize that. But it is something that's becoming more and more common. And so even in this project, we looked at eight repositories. All of them had APIs except UC San Diego and the UCI machine learning repository. And the amount of extra time and hassle it took to get the metadata that we were looking at was quite frankly annoying. If it wasn't one of the main ML repositories, I would have skipped it because we had to web scrape it. And we had to hire a grad student to help us write code to web scrape it. If I didn't need it, I would have skipped it entirely, which I think is very telling. And so if there is an API option available, I think it is something to consider. And because I know this audience is very, very important, resources are limited, use cases differ. Not every generalist or institutional repository needs to make all these changes. No one can be all things to all people. For some repositories, though, these small adjustments either adding fields, changing the workflow to have standardized templates, thinking about an API, making some fields mandatory instead of optional. This might be a good idea. And I think it is a good idea to look at the next five or so years, especially because there is money now, in theory, for AI related upgrades. This is a little bit of both. You are doing some basic infrastructure upgrades, but it also falls into the umbrella of making our collections AI ready, which is a really hot topic. And for other repositories, of course, the most fitting action will be, right? Again, no one is going to be the repository that is the one repository for all researchers in all domains and all methods. And so I think thinking deeply about the specific purpose of the repository and the return on investment for some of these changes, balancing those resource limitations with really increasing the value and reusability of the content that we have put so much time into collecting and documenting and preserving. So I don't have a definitive answer. I apologize. But I think now is the time to be thinking about this and having these discussions before less robust and less well documented and less preservation focused alternatives grow up organically to fill the space. They already are the things I've seen on Kegel will give me nightmares in terms of lack of documentation. But that's what everyone is using. And so unless we can present a compelling better option, we're going to lose out on tons of content and the chance to train researchers on best practices. So I hope that was a little bit of food for thought. I wanted to leave plenty of time for questions. And I did want to close by acknowledging funding from the Librarian's Association of the UC and our research data creation program at UC San Diego. We do have a paper coming out if you want to dive way more into the technical details of what this means in terms of crosswalking metadata fields and provenance metadata and descriptive metadata. But for now, that's where I want to stop and say thank you so much for your attention and I would love to answer any of your questions. This is really high. Hi, I'm Athena Jackson. I work at UCLA. Thank you for your talk. I was curious if you could please pack unpack. I know you were talking about functionality. I heard the premise so I'm owning that. But you really went into a very briefly what teaching with the data implied for how people were searching for that. And I was just curious in your research, was there an idea of like, is this pedagogically ready? Or is there something in that that you're thinking because I've really, I'm very interested in the intersection of the teaching and learning aspect and research, particularly at UCLA and we're really trying to push that undergraduate research agenda. So I'd be just, what were you all thinking when you're thinking about the teaching and learning aspect? Yeah, great question. And part of this I'm going to tee up for David Minor, who is my boss, is giving a talk next. It's going to get a little bit into this, but it is something that we're thinking about because on the one hand, we have researchers submitting content. On the other hand, we have a data science undergrad three graduate programs at UCSD and a ton of grad programs that are computational in some way. Everyone wants to do ML. Everyone wants to do it pedagogically correct. And so it's something that we're thinking about. And one of the steps is really surfacing the ways that students are being taught to think about data, which again is more of this structure and format and variable type and number of instances as opposed to it's, it's a spreadsheet. It's not just right. It's numeric data that's in a tabular format and it's wide and it's got these components and that tends to be the way that they're being framed. But it's so rapidly evolving data science as a discipline that's being taught at an undergraduate level is not that old at UCSD. It's five years old. It started right when I did and they've gone through multiple iterations of their curriculum. So I think it's still in flux and I'm keeping up with what they're doing without trying to overstep. So, but I think one place that libraries in particular can be helpful is reincorporating what we see in generalist repositories about licensing because that is a key piece that has gotten overlooked shall we say in some of the other repositories because it's there. I found it online. I can web scrape it like they have the technical capacity to access things that they really shouldn't be accessing for reasons that are boring and no one is talking about in the courses. So I think that's one of the areas that libraries in particular can have a really good partnership pedagogically. Hi, I'm Dan Katz from University of Illinois. I have a little blue thing on the bottom of my badge that says first time attendee. So I may be missing something. So I just wanted to comment that I think a number of the things that you were saying match communities that you didn't mention. And so I was kind of curious if there are connections into those communities and in particularly in machine learning and software and in data. And so in machine learning within the research data alliance, there's a group that's working on fair for machine learning and part of what that group is doing is trying to figure out what metadata standards there are and how do we unify them so that we can do federated search across different repositories in force 11. There was a software citation working group that created a group called psychodes, which is a group of repositories and registries that store software. And so those 38 I think currently are all talking to each other about what the standard should be for software, what policies they should have things like that. And in data, there's the NSF program, the fair OS RCN program, one of the projects of which is led by Christine Kirkpatrick at UCSD. So I just I guess the fact that you weren't mentioning any of these communities made me curious if there are connections to them that that just weren't at the level of this talk or if there were connections that we should be making to try to to broaden some of these things so that they have even more impact. Yes, big fan of all the communities and aware did not cite them in this talk mostly because our project grew out of seeing those amazing communities form and the work being done and realizing they're going to come up with the definitive standards. That's going to take a couple years right. We need to do something right now locally for our repository because we have this huge influx of researchers bringing us machine learning research outputs. We couldn't find anything that already existed. We're so excited to see what comes out of those, but all of this is really to go back to our repository because on our timeline, we needed something that we could talk about like next week as opposed to to waiting. So we're making small incremental changes and hopefully in your trust, what we're kind of doing as a back of the napkin approach will line up with what the groups come out and say these are the definitive best practices. And for now we're doing our little vocal changes. Okay. I don't know that those groups are going to come up with definitive best practices, but I guess what I would kind of hope would be that your experiences can feed into those groups and help them come up with better best practices. I hope so. Okay. Thank you. Alisha of like more University of Minnesota. I just want to say thank you so much for your talk. This was I have taught statistics classes and our workshops and have been looking for data. I've found myself looking on the Internet for data even though I curate for repository. It is hard to find data in our repository. So this was just like mind blowing and how good of an idea this is. And I've already written lots of notes to take back. One question I have and one challenge I think we'll face and maybe you have insight on is how do we if we have a repository that's limited and a highly structured metadata space. Are there good ideas for fields you could use or how to extend that because I think the searchability by time series what data types categorical data is so important. I just wonder if you have perspective on how to hack the metadata system for that. Yes. One of the ways that we're hacking it is we do have an existing technical details field which is super broad and so that's where we're telling people put in your requirements dot text files like all the Python packages you're using ideally that's where we could give them a template to say this is where you say number of instances number of features. So it's not in a read me which would ideally be more detailed but it is surfaced in a metadata field and so I think we're going to lean really heavily at the beginning on technical details field. If that's not in your repository if there's any sort of like notes or other field. I think if we can give depositors a template for what that would look like in certain cases it will at least surface it when people are looking for it and that's kind of a stop gap because again a lot of these aren't going to change like our repository isn't going to add a bunch of this very specific ML field. I know that but I'm hoping we can use the fields that we do have and give people like a guide book for what we should put in there and we're at time actually so thank you everyone and I hope that was useful.