 Perfect. Okay. Hi everybody. So I'm going to talk about identifying static data, finding data that has been shared, that is associated with researchers at my academic institution. So I realized when I was looking at these slides just now that I did not actually identify the institution, but I am the manager of research and instruction at Lane Medical Library, which is part of Stanford, specifically the School of Medicine. Okay, so who cares? Why am I interested in research data underlying publications? Well, there's lots of reasons to be interested in this kind of data, you know, reproducibility, you're trying to verify the results, trying to reuse it for something, you know, maybe throw some algorithms at it, see what happens, test new methods, build new hypotheses, or, you know, you could just be trying to learn. You could be trying to navigate through the data to figure out what the researchers have done. And if you've ever tried to go out and find this data, you probably notice that it is not particularly straightforward. So this slide here is me having a little bit of fun at some of the responses that you might get if you go look for data underlying a scholarly publication. Anything from like here's the DOI or the accession number, here's exactly where you can go download it, it's publicly available, go get it from here to, you know, we can maybe send it to you, but it's really big or it's on hard drives that are in another country or it's sensitive information, but we might email it to you kind of on the download all the way up to like, no, it's not available at all. So it's a big challenge. The reason I really care about this, in addition to these issues of reproducibility and reuse and things like that, is one of the hats that I wear at the library is helping researchers manage and share their data. So just to define kind of my terms here a little bit, so what I mean by data in this context is literally just anything that is required to evaluate, reproduce, or build upon whatever analyses they're talking about or conclusions they're talking about in a paper. Of course, there's lots of research data that's not sort of described in papers, but focusing today on publication related data. And then for management and sharing data management, it's very jargon heavy term as as John knows, and as I've talked to him about, but it's activities related to data storage, data organization, documentation, communication, basically, not only can you find this data, can you actually use this data, is it organized, that kind of thing. And then data sharing in this context is the release of data for use by others. Very frequently and often at this conference, we're talking about open data sharing where data is made available to everyone. But in the context of sharing data in an academic medical center kind of setting, often data sharing is a lot more restricted. So it's sharing under certain conditions, or to certain people, or for certain reasons. And when I'm sort of doing this part of my job, as I'm helping researchers with this, I often have lots of questions that are very practical and often urgent to kind of the existential. So practical questions like who is actually sharing their data? What kind of data are they sharing? How are they sharing it? And then getting into the more sort of existential questions, I guess, do they know that I exist? Do they know that I am here as a resource for them? And I think anyone who's ever worked in a setting like the one that I work in might have come across the idea that researchers have lots of resources that are available to them. One of the major obstacles is getting them to actually know that those resources are actually available and taking advantage of them. So it's a big universe out there. So I promised my four-year-old that I would put some of his artwork in this presentation. And so it's a big universe out there. And I'm really, as a data librarian, trying to find needles in very large haystacks. And so one way I've tried to answer the questions that I posed earlier is by looking beyond Stanford University. So looking at PubMed Central. So Stanford has lots and lots of information about what its researchers are doing, kind of publication-related information. But information about research data is a lot more ambiguous. If it's captured at all, it's scattered across multiple databases in a way that it's hard for me to bring together and easily access. So one of the strategies I've taken to answer questions is by looking at PubMed Central. So for those who aren't familiar, PubMed Central is a repository containing about 7 million full-text research articles from Biomedical Life Sciences. It is distinct from PubMed, the search engine that any of the librarians or medical librarians who are tuning into this probably visit multiple times a day and many of the biomedical researchers as well. And why it's useful to me for this purpose is it contains articles covered by the NIH public access policy. So if your work is funded by the NIH, you have to deposit your article, articles arising from that work in this repository, and the contents of participating journals. So a lot of open access journals have their articles archived in this database. And what makes it most useful is that it has filters to locate articles with associated data sets, including those with data availability statements. So a data availability statement is a simple statement that tells readers where the work or where the data associated with related work can be found. So it can contain direct links to publicly available data sets or have some description of conditions for accessing more restricted data sets. As I will show examples of this information can be very useful sometimes and sometimes it is so ambiguous that it is not useful at all. But it is a good or at least visible source of information about what kind of data is available and what kind of data is available through sort of non-repository means. Okay, so these are some lightly edited examples of data availability statements. I should say that at this point I'm not trying to single out any individual researcher for statements that they may or may not have included in their papers because I think there's a lot of confusion about how to do these properly. But this is just to kind of give you an idea of what's out there. So sometimes you'll have a sentence that says all the data is in this repository. Here's the link. You'll see things that say the data used is available upon reasonable request, kind of whatever that means but that's the statement that's in there. Or even things that just say like we can't make the data publicly available to protect patient confidentiality or the most straightforward statements that I've ever seen which is this manuscript has no associated data. But they can get more complicated. So the first statement on this slide essentially is data is not available but is available. So data cannot be made available because of privacy reasons, a very common reason cited in statements coming from medical centers but is available from the corresponding author upon request. So it's not available but if you ask us for it we'll email it to you and I try not to think too much about that. Sometimes the data is scattered across many different repositories and many different mechanisms of access. So you'll have data in one repository, code on GitHub, other information that is available on request and then I've highlighted in red some very lightly edited statements where it is really ambiguous what is going on. So most of the raw data are presented in the manuscript or supplement which begs the question what isn't and then some study data is available. This actually is the only one that is a verbatim copy from article and like I'm really just curious on how they are defining every single word in that sentence. Okay so that's kind of the data source and you can see that it's messy. We are relying on self-report from researchers about where their data is. There's also some other limits of there's incomplete coverage here so there's about 10 times as many abstract of papers from Stanford authors in PubMed as there are papers from Stanford authors in PubMed central with data availability statements. So we're looking at a small hopefully kind of representative sample a very small segment of a broader group of papers so this is certainly not I'm not systematically looking at data across every single paper coming out of my university. The statements are unstructured and may not contain all relevant information. I was having a conversation last week and someone said wouldn't it be easier just to look at data citations? You could like create algorithms to crawl that and find that information and that would indeed be easier but that is you know we get a lot of information from these sort of unstructured statements as well. And this is you know just based on my experience teaching data management and doing research about data management even the data that is stated to be available in these statements may not actually be available if you go and look for it on the repository or if it is there it might not actually be useful. So lots and lots of caveats to this method that I'm about to describe but here it is. So I have two regularly updated PubMed central searches running in the background all of the time and every week I get new results which I will describe how I code them just a second but there's you know 1422 results as of Monday from a search looking at papers published in 2020 and 540 published in 2021 so far. And so this is kind of the workflow from there for the 2020 papers I will mostly be focusing on these. I extract the data availability statements from the search, code the affiliation of Stanford corresponding authors, code for how the data is being made available, where it's being made available if it's in a repository. And then for the 2021 papers I'm going to do that same thing but currently what I'm doing is coding the department and then if they're a school of medicine person or a person affiliated with the school of medicine I send them an email which I will show right here which is my way sort of answering the do they know that I exist kind of question so I'm getting emails I'm sending out emails to people who have made data available or at least data that is not unavailable through this data availability statement kind of idea and this is just an introduction to me the services that my library offers and that kind of thing and so I send these out every Monday sometimes it's crickets sometimes as of this week you know I get some immediate responses saying I didn't know that you were here didn't know that we had access to these things you know that kind of stuff okay so for the 2020 kind of corpus of papers I have gone through and coded 1200 of them so far and for about 92 percent the data is not unavailable so there is some mechanism according to the data availability statement for accessing the data and often it is researchers making their own data available but I also have gone through and coded where they have stated that the data is like public data that they're reusing and then of those 1094 papers where the data is not unavailable where I can't figure it or where I can figure out that there is some mechanism for getting the data this is kind of how it breaks down and you'll notice that these numbers do not add up there's a lot of data availability statements that say some of the data is available through this mechanism some is available through this mechanism for the rest of it just ask us so a little under half include statements saying that requests can be made to the authors some other people say that requests can be made to other parties usually the IRB or sort of some sort of data access committee about a third say that in some way data is being made available through the article itself either as supplementary material as source data underlying figures or just a statement that says data is available in the article and the sort of ambiguity around that is something that I'm kind of wrestling on how to deal with but about half of these statements say that some of the data is available in a repository and just to speak on sort of the request to author and request and supplement think this is a our workshop that I give about data sharing condensed into a single slide and basically this is also in the text of that email that I showed where data sharing upon request is not always a great option because contact information can change as researchers move and so getting a hold of a person to actually give you the data becomes more challenging over time there's some pretty good research on this and sharing data as a supplementary material is better than not sharing it at all I guess but those materials can be easily separated from the article and it's hard to kind of get citation information and track information about the data sets so generally we we encourage folks to put things in a repository and this is the breakdown of repositories where things are so you can see because of the long tail here I have included only repositories that have more than one data set in them from my sample and so a couple of notable things github and websites are very much represented probably because I'm including software related to data in my definition of data but there's also lots of people who say our definition our data set is available on github and website is like literally ranges from somebody's personal website to a university website to a collaboration website basically any kind of website or portal that I can't figure out if there's a underlying preservation or a repository underneath it so just a couple things to pull out of this we have a good representation of general purpose repositories each of these is about five percent of the data or each of them on the left side is about five percent data versus a little bit less but far and away the most common you know data repository is an NCBI repository gene expression omnibus and so we have like I think maybe a fifth to a fourth of the the total research data shared out of stanford through this as detected through this mechanism is through one discipline specific repository or data type specific repository which is interesting so what is the conclusion of all this the universe is a little bit clearer for me so this is a slightly less messy universe drawing you know I have a clear idea a somewhat clear idea of where folks are sharing data I have you know a list of faculty that I have emailed and a list of faculty who are sharing data through certain mechanisms that I can reach out to we now have a data source for making internal decisions about supporting different kinds of repositories and spinning up training related to different kinds of repositories and it's just like a really interesting way of also looking at data sharing over time so we're looking at two years I will continue doing this I have it on the agenda to look at previous years but so far it's already yielded some very interesting results in terms of how people are sharing what they are sharing and when they are sharing and some of the single use repositories are things like you know Google Drive or things like that so things that I'm really you know as a data librarian eager to reach out and tell people to stop doing that as well okay so that is my presentation here's my contact information if you have questions of course I will be on Slack afterwards and I think there is time for questions yeah thank you John many people enjoyed the graphic of the excuses and also enjoyed August's drawing so my family so one question was really about the the stage that you're doing the analysis maybe about you know whether or not you could go further upstream like are there are there places in the process like preprints or something else where you could get a little bit further upstream to help researchers yeah so this is definitely like a lagging indicator right like this is like way past the time that really I would want to be involved so when I do outreach to people like please keep me in mind for next time or like let's talk about your next project preprints you know we've started to get some preprints into this data set through the NIH preprint project like they're mostly COVID related and that's been a little bit helpful but I still think that's too late like they've already analyzed the data the I think a completely different approaches needed to get earlier and that's like a more traditional kind of librarian outreach stuff like just outreach to people like come let me talk to your lab about data management and sharing throughout the whole process rather than here at the end at this point it's mostly just like information gathering about what folks are doing or have done