 This is really a wonderful opportunity and I want to thank the organizers and you folks for attending. Consider this really much a companion talk or a follow-on talk to the talk we just heard from Victoria. I'm going to be focusing on data. We're both very interested in sharing and in sort of the nature of science and how it gets done currently. My background is also in statistics. So I guess I was a data guy. My job was to design experiments, help scientists organize their thoughts about that, actually be involved in information systems that were used to collect data. I analyzed data and provided to others. So today we're going to be talking about improving research data sharing, especially in the context of repositories. In my current work as the Vivo Project Director, I've come more and more in contact with the library community, the way they think and their role in helping with the problems of data sharing and reuse. We're going to talk a little bit about where data comes from, what are the data processes of science. What do we mean by sharing, or at least what I mean, and I hope to generate some conversation. What is this data sharing that we're talking about? Is the sharing important? Why don't scientists share their data? In that, I'll have a series of ideas that have all come from my personal experience working with scientists and being one myself, so what were we thinking when someone said data sharing and we said, no, what were we thinking? The context of data, that is what it would take to produce data that could be reused, and then what could be done to improve the sharing of data, and for that it's pretty much an open conversation, and we should leave some sufficient time to have a conversation about what could be done to improve the sharing of data. Let's start with. Where do scientific data come from? Most of my career was spent outside this room. For me, this is a data production facility. You might have thought of it as a surgical suite, and you might have been the person on the table there, not potentially thinking about data, but this is where data comes from, and so there's some kind of climate in which data is the byproduct of what's going on in this room, and whether that's imaging or a designed experiment or medications or surgical procedure or details about the procedure or the people who are involved or all of those things, it's data production. As a very junior biostatistician, I was involved in kidney transplant registry, and I was analyzing the success of grafts, kidney grafts, and the success of the patient, whether they lived, and I had done my sophisticated analysis as a junior scientist, and I showed it to the principal investigator, and he looked at what I had in my model, and he said, if you want to know whether the patient's going to live or die, you have to know who did the work. There are good surgeons and not so good surgeons, so there's a lot of context about the procedure and the work and the science that we have to be aware of that might have to go with the data, what we might have thought was the data, in order to be successful. So there's an example of a context for the production of data. Here's another one. So this is the large adron collider, it produces data. It produces a lot of data, it produces a lot of data very quickly, and a very, very different kind of data. These guys are pretty good at cyber infrastructure, and they're capturing a lot of data. So sometimes when I see a machine like that, I'm thinking about the data that's coming off of it, its volume characteristics, the context of that data. There's another kind of data that we had some experience with over the course of my career. The agricultural people, for people who are in statistics, we owe a lot to the agricultural people in terms of designed experiments, hypothesis testing, and the concept of data. I always liked working with agriculturalists. They knew that they had to plan their experiment in order to get a good result, and that planning was well worth it because it was going to take some time to grow the experimental results. So they were very interested in planning, and the best among them were the forestry people. It was going to take a long time to get that data. And so they were good at thinking ahead, but their context and the way they thought about data, different. And now of course they also have their own molecular consequences, and they are now heading over into environmental science. And over there, we have a context that is again quite different, with sensor nets and imagery, orbital imagery, and other kinds of world scale data coming at us from a lot of different directions. So a different context for data and a final collection for data is us. So we're all producing data apparently every moment of every day now. And so the social networks, the Google Flu tracker that we heard of, examples of data coming from each of us through our personal devices and generating some kind of analyses. But all of those scientific data things, those processes, those domains, have some underlying scientific data processes and some underlying scientific concepts. And so what might they be? And so here's a figure that helps me kind of think about what data production and reuse looks like, at least what I hope it kind of looks like. So down at the bottom, you have scientists creating data. And they're creating data in all the ways we just heard about. And they're creating data sort of for their own purposes in some kind of isolation, maybe in a team science set up, maybe in a large scale team science set up like the Hey Drunk Lider, but at various scales of isolation or team science. And they're doing that with some kind of limited or local kind of machine readable semantics. They may not have any of that. They may not know what that is. Some kind of irregular preservation sufficient for getting the next grant. And in general, a highly variable means of production, highly variable. So they're producing data in a wide variety of different ways, very wide variety of different ways. And now we're asking them to share their data. And apparently that is a thing. There's the White House and there's papers and conferences and it's a thing. We're supposed to share our data. And that seems like kind of a simple idea. But as we just heard, in terms of code and the concepts very similar for data, there's issues of disclosure and discovery and data use agreements. And maybe some kind of curation requirements on the way to the repository. And maybe some kind of formatting requirements on the way to the repository. So if you're going to share who you're sharing with, you're sharing with maybe a library, maybe a research institute, maybe corporations, maybe agencies, a federal or state or local agencies of some kind. You're putting the data somewhere that isn't your place. And on the way to that place, things are going to happen. And so we want to think about those processes. What are those things that happen? So I have the data, I'm using it, I mean, it's in my flow. I'm working with it every day. And now we're going to share it and something is going to happen. What is that something that's going to happen on the way out of the scientific context and into some kind of preservation mechanism that isn't mine as a scientist? That's data sharing. So what is all that stuff? And on here it's the most purple arrows. Something is happening between the scientist and the archive. Now at the archive, those agencies and libraries, institutes, corporations, they have their own worldview, they have their own incentives, they have their own way of doing things. And they may have some ideas about semantics or the way data should be structured or some kind of whatever. But in general, they tend to have very good, good, strong concepts of preservation. That's what they're there for, they're archivists. They believe in preservation. So whatever you give them, it should stick around. And the question is what to give them. And if we gave it to them, how would we get it back out the other end if we wanted to reuse it? So but the archives in the middle there, they have very variable access policies and very different ways of doing things, very different procedures. And if you're a scientist, you're dealing with, well, maybe dozens depending on the nature of your research, it could be hundreds of different archivists, each with their own formatting, preservation instincts, policy, and how you would share with one is not the same as how you would share with another. And now if you're in the NIH realm, maybe there's a mandatory sharing requirement with some specific repository that was named in the RFA. And so you're going to have to master that in order to be able to do the work or to be able to accept the grant. And if you did that, so if you produced data and you shared it and the archive had it, then you would want to reuse it. You would want to reuse not only your own data, which was put in the archive, you'd want to be able to use all your colleagues' data. You want to be able to go get it. You want to get it and you want to be able to combine it, and then you want to be able to produce some kind of new science with it. And that may be, you know, you may be a domain scientist or a data scientist or even a citizen scientist looking to get data and put it together for some kind of reuse. And to put it together, you're going to have to get it out of those archives. And you're going to have to align it in some way. You're going to have to make it into a set of data. And there are real, real issues there with access control and final format and semantic alignment and all kinds of other things in order to be able to actually do that so that you could eventually reuse the data. So this is kind of a framework that I hope can help us think about what the nature of some of these processes are, the creation of data, the sharing of data, preservation, the preparing for reuse, which unfortunately is still a major, major problem, and then actual reuse. When we think about the production of data, you know, I've been around a while. It used to be kind of simple. You know, you design an experiment, you tell them what to collect, they collect it, they put it in a computer, you add it up, you know, publish a paper, it's sort of okay. I was involved in clinical and translational science institute of my university and helped conceive of our data practices. And this is sort of what it looks like now. So now we have a lot of data capabilities. So if we're thinking about, you know, what it looks like at the present time, data production is coming from a wide variety of different sources. It might be coming from the electronic health record or from the laboratory or patient reported outcomes, case report forms and clinical trials. Maybe you bought data from somebody and then you have to put it all together. So there's data assembly, all this data is coming from different directions. You're going to do identity matching, ontology alignment. Is this the, you know, these are the variables they were talking about over here, provenance tracking of the data. There's all kinds of assembly issues. And then there's all kinds of data validation issues. Is the data right? Sorry, you're going to look at various kinds of patterns in the data, do various kinds of data management practices and monitoring process management. If you've got flows set up. So those data validation things are quite significant. And all of that implies some kind of data infrastructure, some kind of cyber infrastructure that enables all of this kind of thing. And so you have your own repositories and registries and archives and indexing processes and data preservation, people working with you, database administration, all of those things. Then there's data enrichment and trying to fix and make the data better in some way, usually through aligning it, phenotyping, deriving variables of various kinds. Phenotyping is where you take a look at all these kinds of things and decide, this person does have leukemia and that person does not. And that turns out to be not an easy problem. And so guidelines for phenotyping and algorithms for phenotyping and validation of phenotyping itself is a very big issue, NLP, natural language processing, the processing of the data. Then there's data provisioning. You do have to provide your data in some way to somebody sometimes. And that might be reports or summaries or aggregations of the things that you're putting together for your papers or visualizations. Sometimes there's published data sets and sometimes there's data sharing. And you have data analytics of various kinds, the kind of traditional role of statisticians. And now much has been co-opted under the term machine learning. Cutting across all of these things are things that maybe your institution is involved with, reaching in and dealing with things like data governance, process and policy, security, privacy, compliance, things that have to be in place in order to be able to handle data, particularly protected health information. Data representation standards in which your research group is working with research groups around the world to make sure that what you're talking about, what they're talking about are actually the same thing. And software tools of various kinds that we just heard a whole talk on. Software tools, identification, development, assembly, integration, validation, access control, to the tools. So the modern data environment is quite complex. And in there somewhere is data sharing. So what is sharing? I'm going to say it's kind of, I'm going to take a very simple view of it at the start and then we can kind of drill down. But I'm going to just say it's providing scientific data that others can use. Because providing scientific data isn't good enough. It has to be so that others can use it. And if what we think about, what we mean when we say that others can use, they're in, we'll lie to the problems. So providing scientific data that others can use, is it important? We probably wouldn't be here if you didn't think it was important, but I think it's constantly, it behooves us to re-examine why it's important so that we can constantly make the case. I have several cases that we'll go with, here's a scientific argument. First, there's a scientific argument around linking. So science is reductionist by nature. Different groups work on different parts of the problem. And then we need to combine those results across the different parts so that we can learn new things. Seems pretty clear. So here's a reuse scenario. It also involves BRCA1. The question is, find all the faculty members whose genetic work is implicated in breast cancer. That's hard to do, because the people who know about molecular science and what molecules are involved in breast cancer are not the people who are looking at the papers. So you have a faculty member. They work with a particular gene. That gene is implicated in breast cancer. The source of those two data, the fact that the faculty member works with a particular substance and the fact that that substance is involved in a particular disease, those two ideas come from two different directions, curated by different people, and need to be associated in one place in order to be able to answer that kind of query. So it's a data linking kind of scenario. And here's another one. So data linking, serious bottleneck for lots of different kinds of work in pharmaceutical and biotechnology domains. This is the result of a project that was done in the EU to assemble 20 complete data sources from various gene-protein interaction pathway target disease patient, and put together approximately five billion facts that then could be analyzed to try to determine what actually is implicated in what kinds of pathways and what kinds of mechanisms and what kinds of diseases. And they called it the large knowledge collider. Kind of like that idea. We should have such a thing. OK, another scientific argument for sharing is pooling. We want to be able to combine our pool data from multiple experiments. And then in this way, we could perform what's called a mega-analysis, not a meta-analysis. That a meta-analysis, you look at the findings of various papers and do a particular kind of statistical reasoning over the findings of the papers. Here, we want to actually have the data from different studies and pool that data and do one large analysis to increase the power to detect the facts and be able to study different kinds of things that might have been out of the hands of particular researchers. Pooling is not currently common. It's extraordinarily difficult for all the barriers that are involved in data sharing, but it's also very, very difficult, even if you had the data, to determine whether it could be pooled from a scientific or a statistical point of view. So sharing sufficient for pooling is extraordinarily rare currently. There's another reason for sharing, negative findings. So negative findings, if you're in science, you know negative findings are difficult to publish. That's unfortunate because they're findings. We want to know everything that happened. We don't only want to know the things that were positive. We want to know everything. So this inherent bias, the negative finding bias, inherent in literature, it can be counterbalanced by publishing all data from well-conducted studies. We're only interested in whether you did the work well. We're not interested in whether you got a positive or negative p-value. That's not interesting. What's interesting is you did work and you produced data. Let's have it. So we're interested in all data from well-conducted studies regardless of the findings. And this is the rationale. So there has been some good work done on this. This is the rationale between the NCI Cancer Study Registry and clinicaltrials.gov, the US Clinical Trials Registry. Now whether you could get the data is another issue. But at least you know that there was a study going on and you would know what the findings might be. And that helps you track down some things. Yes, for sharing the ethical argument, as we heard, without sharing of knowledge, there would be no advance in science. But then is it true that the data has to be shared or just the findings? So we're still a little, we still have an edge on that argument. And yes, the public funding argument for sharing. The taxpayers paid for the products of the research. The data do not belong to the researcher. They belong to the sponsor. Well, that's very unpopular with many of the people who are doing the work. And they put their lives into collecting that data. And they believe that they exert some ownership over the data. Ownership's a tricky, tricky concept. But there is a public funding argument. So why don't scientists share their data? So I was at this for my entire career. And I will say, everything that we're about to hear is nothing new. These are arguments that scientists used on me over the course of the work. First, competitive advantage often takes the form of, I'm not finished with the data yet. I'm writing more papers. What they've made me to do. If I share now, my competitors may publish or submit grant proposals using my data. And so I just can't. Grand funder researcher, if I don't get the grants, I'm not there. So competitive advantage. And then a variation on competitive advantage is the prisoner's dilemma argument. If you share and I don't, I have both of our data sets. And if I don't share, you only have your data. So that's a variation of the prisoner's dilemma. And it proves that sharing is dumb if you're in a competitive situation. So we really have to work hard to get past these kinds of things. OK. It's too difficult. The corollary here is I'm not paid to share my data. I wrote a grant budget. I spent the money on rats. That's a good thing. And if you want me to spend money on ontology or data managers or cyber infrastructure or whatever it is that you're talking about, I will spend less money on rats. And so that's kind of important to keep in mind. Sharing comes with costs, real costs. And so when there's a mandate that says, here's your budget, but by the way, share your data, it doesn't make any sense. As a researcher, it doesn't make any sense. There's real work involved in sharing. And it's going to come from somewhere. Their work is going to come from something. Try going back to your dean and say, you need another 20% on your grant. So if you share your data, see how that goes. I don't know how to share my data. Well, that may well be true. The processes for sharing data can be quite complex. Those advanced data capability things are not abnormal. The world is complicated. And it may be unknown to the scientists exactly how they're supposed to go about this. So they may be coming up against requirements they really don't know how to handle. At least in my world, that's why there was a CTSI, was to help researchers with things they might have to do that they didn't know how to do. No reward for sharing. They told me I have to do it. I checked the box, next grant. This isn't helping me or my career. My career is built on my papers. I don't get anything for sharing my data. It's not on my VEDA. It's not on my promotion packet. My sponsor does not penalize me for not sharing my data. I spend some time on the concept of research data management plans required for the grant. It's like, look, it only matters. It doesn't matter whether the RFA says research data management plan. It matters whether the research data management plan is in the scoring criteria. It's not in the scoring criteria. It doesn't matter what you wrote, right? So I was involved in the $10 million center at our data sharing requirement. It wasn't in the scoring criteria. Our data management plan said we will do whatever you tell us to do. I think it was approximately three sentences. And we were funded. So sure, it has to be in the scoring criteria, otherwise useless. My sponsor or employer won't let me share. That's true. Depending on the nature of the research, the sponsor or the employer or the scientist may be prevented from sharing. And this has to do with de-identification, personal identity information, personal health information, intellectual property. It could be time consuming. Some sponsors or employees just say the heck with it. We're not involved with that. It's not important to us. So it's not important to you. And finally, and maybe this is, for me, the most pernicious argument. So all of that seemed pretty serious. Here's the one that's, for me, the most serious. The scientist comes to me and says, the shared data cannot be reused. Cannot be reused. The context of the shared data is missing. Only the scientists who originally produced the data understand the data sufficiently well for meaningful reuse. And this is not a new idea. And this idea is current. So I've had principal investigators of major, major, major groups come to me and tell me exactly this. Cannot be reused. Only we understand the data. And by we, we mean the people in my lab. Well, you wrote a paper. I mean, yeah, yeah, yeah, yeah, whatever. I mean, the paper is advertising for whatever it is that we did. And we have a finding. And we produced it. And I'm glad they like it. And we get it published. But if you try to get our data, and even if we provided you with the data, you would be lost. And you would have to join our group in order to understand the data. And that's pretty discouraging. Because we'd like to share data. And we'd like to reuse it. Then you have major scientists telling you things like that. So what is this context of the data that they're talking about? And it is true that the paper can aim some of the context. But as we heard in the case of the software, it's very minimal. It's very minimal. There's a whole iceberg of information going on about what that scientific investigation is like. And you're getting 8,000 words or whatever you're getting. And people understand. Perhaps we outgrew the paper. The context became complex. And whether it's software or laboratory procedure. I mean, there are some pretty damaging findings around this in the scientific literature around nomenclature in particular. So in the molecular sciences, nomenclature became complex. Biological molecules. You can't just, you can't call that methane. It's much more complicated than that. So if it has an atomic weight of 34,000, you better have an enchi on it. Otherwise, we don't know what molecule you're talking about. And it turns out that within two years, 40% of the molecular literature can't be read. Because a person outside the discipline can't tell what molecule they're investigating. Because the nomenclature in the paper is out of date. There has to be an identifier. And if there's no identifier, there's no way to know what they did. There's no way to know even what molecule they were working on. So the context is incredibly important. And we know that, you know, I mean, a lot of scientific literature is moving very rapidly. And we're only interested in papers that were published very recently because, you know, that's where the interesting findings are, but they're also the only papers we can read. So the recent concern about reproducibility of science that Victoria is the expert on, demonstrates some of the limitations of the paper for providing adequate context. People who have attempted to read papers and reproduce the work from the paper find there just isn't enough in the paper to tell them how to go about attempting to reproduce the work. So if you work in clinical science or maybe laboratory science, maybe there's a protocol, there's a laboratory protocol or there's a clinical protocol and it contains some of the context. Maybe it's supposed to contain all the context, but maybe there was a bunch of assumptions in there about what was actually being done that were shorthanded in the protocol or certainly short-candidate, in my experience, certainly shorthanded in the laboratory protocol. The clinical protocol is a little more elaborate, but the protocol may not be complete and they often use local conventions that are sufficient for directing the work that's going to be done, but they may not provide enough context to others. Just think about your protocol being attempted in a different country. Is everything that the person needs to know in that protocol? Everything that they need to know in that protocol. May lack specificity required for reuse. And obviously the protocols are not often published. And so what we got to see about the work was a small little piece of it, not what actually was done. That method section is very short. The protocol is long and even it might not be sufficient. And then we could go into, you know, kind of cyber infrastructure really visioning and say, well, even if it was published, it's not actually machine readable so I couldn't actually use the protocol in any reasonable way. I would have to recreate it into some kind of structures that would allow me to do something with it at some scale. And if it was machine readable, is it actually like linkable back to the actual collected data? So one could imagine a world in which the actual collected data is fully specified by linkages back up into a protocol description that is completely sufficient for representing what was done. And we're a long way from there. So the context problem is an elaborate and deep problem, I believe. And so with that, we have plenty of time for conversation and questions and the overarching question is what can be done to improve scientific data sharing. So we have these important notions of sharing data across scientific contexts. What can we do to improve our ability to share data so others can use it? And with that, I wanna leave us on this figure just as a prompt for our discussion. So thank you.