 Welcome to session two of our Open Science Symposium. My name is Hannah Gunderman. I'm a research data management consultant with the CMU Libraries, and I'm very excited to introduce our speakers. Our first talk is going to be Dr. Lenny Tatelman, who is the CEO and co-founder of protocols.io, and his talk is going to be about sharing more openly and reproducibly does not have to be a burden. Thank you, Hannah, and is there a timer? I know there's a two-minute warning. Okay, Hannah, it's 15. Thank you CMU very much, enjoying the day and absolutely love the speaker lineup. I'll start as I usually start when I talk about protocols.io, but then I'll veer into more of a discussion around when we talk about reproducibility, when we talk about saving time for researchers, there's two sides. So one side is the people reading your paper and the results of your work, and the other side is the people doing the sharing. So let's start with the people who are reading, and that's how I usually start the talk on protocols. I also, this is a postdoc at UC Riverside, who tweeted, I'm looking for protocol in 97 paper as described in 96, finds 96 paper as described in 87, finds 87 paper and even UC Riverside doesn't have subscription to it, it's too old. So it is a common and rather frustrating experience. Of those of you who do research, how many in the past year have read a paper that had a version of something like this? Just raise your hand. Yeah, so pretty much all hands go up when I ask this. It's not just in biology. And here's a physicist who says he's reading and devices were fabricated as previously described, previously described, previously described. Devices were fabricated with conventional methods as the original reference, right? And I actually have, I bookmarked these as I come across them every couple of weeks. I would spend 15, I could spend 30 minutes just going over all of those, so I won't. But obviously it's a very common thing and we started the method repository protocols.io in 2012 we had the idea, launched it in 2014, it's open access and we started with the mission of making this better. And obviously this is something that is much needed with given how many hands have just gone up. So we do want to improve this. But what I want to focus on for most of this talk is that this is not enough, right? So it's a good mission. We want our papers, when we read them, we want the papers that we're reading to have a link to the details. So if we're following up on the work and not just, obviously not just the methods but data, code, reagents. So that is helpful for us as readers, but I think a big part of the problem, and we already talked in the previous session in the panel discussion about the incentives, a big part of the problem is what happens to the person who is sharing, right? And while we all want to read more reproducible papers, it actually does take time to be more rigorous, more reproducible and to share well. And it's not necessarily in a significant amount of time. It's just throwing up code on GitHub without comments is not necessarily a good code that is helpful for other people, right? And if you want to share well and enable others, it will take time. And time is something that is in short supply in research. So 2015, there was a paper that was first in Bioarchive appropriately and then in PNAS called Accelerating Scientific Publishing and Biology. This is from Ron Vale at UCSF and he is a professor who's a huge advocate for sharing preprints and adoption of preprints in biomedical sciences. And he took a look at how big are our papers now compared to in the 1980s, how complicated are they and how long does it take UCSF students to publish their first paper? And so in the abstract, he highlights more experimental data that are now required for publication and the average time required for graduate students to publish their first paper has increased and is approaching the desirable duration of PGD training because publication is generally a requirement for career progression. Schemes to reduce the time of graduate students in post-doctoral training may be difficult to implement without also considering new mechanisms for accelerating communication of their work. And it's a really nice paper, just a few of the figures. This is a heated comparison for cell nature and JBC between 1980s versus 2012, 2014. And in red, you can see the increase in cell papers from the number of data points. And it's not just like, oh, we now have genomics and genomics, there's more data. No, it's really different experiments that are being counted. So the number of panels in the figures, the figures are becoming more complicated, the number of experiments in the figures has increased dramatically and this is not just a cell and nature, it's the same thing at JBC. And then looking across the different journals, the question of how long does it take to publish? So this is for UCSF graduate students the time to first author paper. And in the 1980s, it was under five years. So I think it was like 4.7 years on average and 2012 to 2014, it's now on average six years for the first author paper to appear, which is the upper boundary of how long UCSF wanted to be a PhD student, right? So over this time period, the average time to your first publication has increased by 1.3 years. And the rest of the paper is an argument for us adapting bioarchive, which I think is a great thing to do, but I want us to keep this paper in mind as we talk about reproducibility incentives and how do we expect people to share and adopt better practices and who is doing the adoption. So here is a more recent paper in plus biology that I reviewed that's called Open Science Challenges, Benefits and Tips in Early Career and Beyond. And in the abstract there, the paper starts with the move towards open science as a consequence of seemingly pervasive failures to replicate previous research. This transition comes with great benefits, but also significant challenges that are likely to affect those who carry out the research, usually early career researchers, right? So if we think about better sharing, I think this is spot on that it's usually not the professor who is fiddling around with SRA submissions, right? It's usually not the professor who's putting the code on GitHub or has to type the methods into protocols. I owe it's the students and postdocs most of the time. And if we think back to the PNAS paper, the increasing time to publish, the increasing number of things we're asking trainees to do, we have to be really mindful as we ask them to not only publish with more things, right? But we're also asking them to do a better job of sharing when they publish. And if we are not careful, if we only lean on requirements, compliance and checklist of publication or funder mandates, then I think we risk asking those students and postdocs to do even more, right? And that's, I'm not saying that it's not important, like I think journals, exactly as public library of science with their data sharing requirements, I think journals should be requiring data. I'm not arguing for lower quality control. I don't think that we should relax the expectations necessarily, but we also need to think, how do we help the scientists? How do we help the trainees and not just keep piling more and more burden on them? And it's important to do in parallel with all the efforts that funders not just in their age, but many other funders are thinking of like what do we expect from the grantees? Particularly because the grants are signed by faculty, right? Who agreed to all the data sharing policies, but it's, again, the students and postdocs who have to do that sharing. And so the example with protocols I own, it's a challenge for us, at Protocols.io there are now over 500 journals that have added us to author guidelines, which is fantastic and it helps a lot with recognition and adoption of Protocols.io, but as we think about it at Protocols.io, this is too late, right? So if you have your protocol in a Microsoft Word document and now you're submitting a paper and you have to do yet another thing at submission, submission takes hours for a typical journal, right? Not only have you now taken five, six years to do the research and now you're taking a really long time preparing it for submission, now we're asking you to do more things. Take it from Microsoft Word, register on Protocols.io, right? Import the document from Word to Protocols.io and then add a link to it from the paper. It might take you only an hour or 15 minutes, but it's still yet another requirement. And I think that's a challenge, not just for Protocols.io, but for all of the tools, for all of the resources, for all of the repositories, how we're tracking the metadata, how easy do we make it to submit the data sets, right? And so we've been for years now, we've been focused sort of internally, not just on the mission to improve reproducibility, but we keep thinking about the scientist and what guides us at Protocols.io is from the moment you sign up, long before you're thinking about publishing, we have to be a tool that actually makes your life easier. And so we keep thinking about the interface, we keep thinking about the functionality, we've built mobile apps, iOS and Android, and sort of made a Dropbox for Protocols in hands Dropbox, right? For Protocols where you can organize all of your labs. Research Protocols collaborate in your lab, you can do it all privately, long before you're sharing publicly, but that's a side that we hope and we already see from the people who are using Protocols.io, they use for years before they're ready to publish, right? So we have almost 20,000 private Protocols and 6,000 public ones and every month we get 1,000 new Protocols and of them, two, 300 will be public and the rest are 7,800 are private. And we've been sort of obsessed with this idea of how do we make it useful right away? And what I love about many different initiatives in this space is that there are lots of good examples of efforts aimed at improving reproducibility and encouraging open science that are not additional burdens on the scientists, right? And I have some shout outs. I realized last night as I was preparing this that some of the people that I was going to shout out to are actually talking here. There will be discussion like, so for example, at the bottom, Jupiter notebooks, right? That you'll hear more about it today, but that's not something we're asking you to do at publication, right? Five years, that's something that helps you as you're doing the work, as you're developing the code and writing your scripts and doing analysis and collaborating and then it makes it easier to share if you combine it with GitHub or bind there, right? So that's a benefit to the researcher and that's why these things get adopted. And it's not just about tools. So the public library of science recently announced a new initiative on opening peer reviews you obtain as both author, whether you want your reviews to be public or not and then reviewers decide whether they will sign them or not, but this is an example where we're doing the reviews anyway. We're not asking for extra work, right? But we're now giving an option of making those open that helps with open science. There are lots of reasons why we think open peer review could be a good thing. It's done in a smart way that gives you, if you're not comfortable with it, if you want it to be anonymous, you can do it that way, but yet another example. Carpentries, I think Carly's talking after me. So the Fred Hatch is one of the supporting members of Carpentries. Carnegie Mellon is one of the supporting members of Carpentries. That's an effort that teaches people how to do better data science and better coding if you're a biologist and don't know computer science and bind informatics. That's empowering, that saves your time, that teaches you to be a better scientist, right? And quick workshops. Without waiting for that publication moment to ask you to learn more statistics or do more rigorous work, right? Another great example, there are edging stickers. Actually, I didn't know they're sponsoring this. They're edging stickers outside. That's a repository you share there and it's not just good for everybody who needs your reagents, but when people request, they're not requesting from you. So you're not working as a mini UPS center if your plasmid is really needed by many people, but edging handles it for you. So you as a scientist benefit. So those are the types of things you'll hear a lot more about pre-review. Again, we're all doing journal clubs and what Daniela is doing, she'll tell you about, but it's not extra work. It's making that work more constructive that I think are wonderful examples. And in the last minute, I also wanna do a shout out. This is a volunteer effort that I helped to co-organize together with E-Life ambassadors at Gene and Code Ocean. We launched reproducibility workshops at many conferences, 90 minute workshops. And we call that reproducibility for everyone, meaning not just people who are publishing, but also people who are reading, so including you. And what we always stress in the tools and resources we highlight there is, you're your most likely collaborator in six months. And so the tools we like to highlight are the ones that will save you time as the author, as the researcher. All of that is open, the handout link is there. This will be on OSF shared. And in the last 10 seconds, I just wanna encourage people who build tools and resources to keep thinking about the interface and keep thinking about the increasing time to publish the increasing burdens on scientists and what can we do as the resource providers, as tool makers to make the scientists' jobs easier? Funders, I see more funders starting to play a role. It's important to invest in infrastructure supported and also investing in training. So Data Carpentries just got a grant, I think yesterday announced from More Foundation and Chen Zuckerberg Initiative. Universities librarians, Carnegie Mellon gets an extra shout out as the first university to sign up for Protocols.io so that you can have unlimited private use if you're at CMU and you have amazing librarians here that usually don't get the credit that they deserve for the workshops, the training and the support of researchers, but we need more of it from all of the players and not just an expectation that the scientists should do better. And with that, I'll wrap up. I didn't do any demo of the platform itself. I'm here tomorrow from one to four. If you have questions about Protocols.io or want a demo, I'll be in this building from one to four. So drop by. The reproducibility workshop here. Hopefully. Right? We hope. So stay tuned. Okay, I'd like to now introduce Dr. Carly Strasser. She's the Director of Alliances and Data Strategy at the Fred Hutchinson Cancer Research Center in Seattle, Washington. And she is going to be talking about where is open science and cancer research. We're keeping that on time. So thanks everybody for being here. I heard a lot of good things about this symposium last year. So it's really nice to be here this year. If you haven't heard of Fred Hutch, you're probably not alone, but the Fred Hutch Cancer Research Center is located in beautiful Seattle. Look, it's one of those two sunny days that we get. And it's on South Lake Union. So it's a really good little spot with lots of nice views. And just to give you an idea of kind of what we do there for an independent research institution, we're part of a larger cancer care consortium in the Seattle area. We have about 2,000 people that work there, 238 faculty members. And we're most famous for pioneering bone marrow transplants. So although we are near University of Washington and we have affiliations with them, we are separate. And we actually don't actually have a clinic. We have a lot of researchers that work in the cancer space. And we partner with Seattle Cancer Care Alliance, which has a lot of the clinical work that happens in the area. So this is what I know. I got a PhD in Oceanography and I know about clams and copepods and whales and things. And that's a pretty long haul from there to getting to the cancer space. And that was a very windy journey that started with an organization called Data One, which is a National Science Foundation project around sharing ecological and environmental data. And I also worked briefly at CDL, the California Digital Library, on developing software and tools and doing outreach around open science for researchers in the UC. And then I also worked at the Gordon & Betty Moore Foundation advocating for data science and academic institutions. So all of these things kind of led me to the hutch and I was hired there by the chief data officer who started the hutch data Commonwealth. And so this organization is an internal part of the hutch which has about 50 people and we build data intensive research capabilities through software and data engineering, training and partnering. So alongside the hutch data Commonwealth, there are a couple of other groups at the hutch that work alongside us trying to figure out how to help researchers with their data intensive research. There's a newly minted translational data science integrated research center. And then there's also a bioinformatics and data science cooperative, which is more of a community group to try and pull together folks that are working on similar projects or trying to use similar types of data. So I like to explain kind of what I think about and what these groups think about in terms of data or at least in this case made up data. So this is a graph with a number of researchers and the data skills and experience there on the bottom. And you can imagine, this is kind of a normal distribution right now. So somewhere in the middle is your average researcher who might have some skill sets, but isn't necessarily going to be a super rock star in the space. You could imagine, say Casey Greene being all the way way over on the right. And that really old crusty clinician who was trained back in the 60s is going to be way over here on the left. So what we really want to do is push, right? We want this curve to go to that end over on the other side. And the way that we do that, the way that I think about doing that is through training, consulting and then collaborations and partnerships. So you can also think about this in terms of the number of projects that use data and focus on using data intensive research to really make interesting discoveries. And so this is also kind of a normal distribution and I'd like to think that we're gonna keep pushing it over to the right. So bigger data sets, more data, being able to integrate across lots of different types of data and that's really gonna require things like public databases, data sharing and again, collaborations and partnerships. And so that's where I see the role of open science in the field that I work in. I also, I think of open science as a bit of a panacea and I think that data science and open science have a huge amount of overlap. So being able to really effectively use data means that you need to learn from your colleagues on how they did their methods, which means they need to do things in the open and they need to be able to share their methods with you and they need to be able to share the code and the data that are behind the publications that they're creating. And so they all kind of interact quite a bit in this space. So at the Fred Hutch, when I arrived a little over a year ago now, I was curious, okay, well, this is a cancer research center. I am totally new to biomedical space and I was wondering, okay, where's the open access policy data sharing policies? Do we have a repository for data? Are there other open science tools that we advocate for at this organization? And the answer is no. We don't have any other things. This is not something that really seems to be a major part of the culture at the Hutch and I would suspect it's similar at other cancer centers. So just to talk a little bit about what this problem looks like in practice. So for sharing publications, I was doing some research and I found this really interesting article, open access takes root at the NCI, but alas, I could not access it. So you gotta love the irony of that. A lot of people in the cancer space in particular think that the NIH requirements are sufficient. So if we just use NIH as our bar, then we know that the NIH public access policy mandates things be archived in PubMed Central no later than 12 months after publication. This allows people to publish in things like cell that don't allow you to share results more than 12 months in advance or post publication in their journal. And so when I got to Fred Hutch, I started looking into this and I actually got an intern from the iSchool over at University of Washington to help me look through what kinds of things were going on in the publication space at the Hutch. And if you weren't aware, there are 71 NCI designated cancer centers in the US. Of those, there are 50 that are considered comprehensive cancer centers, which means that they have both research and clinical. 12 of those 50 have open access policies. Those are primarily through academic affiliations. So if they're associated with a university, the university has a policy and therefore the faculty are held to that same policy at the cancer center. Eight of those 50 are independent cancer centers. And there are no OA policies among those eight independent cancer centers. And this just kind of stunned me. I was really surprised that there wasn't more of an effort being made in this space. This is some data that the intern collected, Dana West. And these are the five divisions that we have at the Hutch. So public health, clinical vaccines and infectious diseases, human biology and basic sciences. And you can see that there are lots of things that are published in non-OA spaces. And in fact, 47% of all Fred Hutch publications for the fiscal year 2018 were not open for 12 months. Some people might say that's not that big of a deal. Maybe it's fine. 12 months isn't a big deal. Maybe some of these were in preprints somewhere, although we looked and we couldn't find preprints for these 50, 47%. But think about it in terms of the fact that our president, Gary Gilliland, has declared that we're going to beat cancer by 2025, which is a really grand goal and it's a really good one. That's only 1,882 days away. So 12 months of not having access to publications is a big deal. And it makes a huge difference in terms of what we can really expect from researchers who maybe aren't working at Fred Hutch and don't have access to publications or like me are working off campus and can't get to the publication because the logins aren't set up correctly. So that's publications. Let's talk about data sharing. So I think we can all agree that sharing cancer data can save lives. There's lots of good evidence for this. The GA for GH group is talking a lot about worldwide sharing of cancer data and recognizing that the disease knows no borders and we're not going to be able to do this unless we all kind of work together. But there's lots and lots of barriers to sharing data and we've heard some of these talked about today already. So here's a really great article in the Lancet, challenges of data sharing valuable but costly question mark. I won't know because I can't access it. So yet more evidence that we have this problem of open access in this space. But another argument I hear is NIH requirements are sufficient. Looking through NIH requirements, the NIH will expect investigators supported by funding to make their research available for subsequent analyses. The data sharing plan is required for things over 500,000 and there are certain institutes that have requirements around sharing certain types of data. I'm not gonna go too much into this because approximately two hours ago, Lisa spoke about requirements at the NIH around data sharing and she's way more of an expert in that than I am. But I would argue that maybe those aren't sufficient that we should really be thinking bigger and especially when we start thinking about trying to cure cancer. And in general, reproducibility requires all of the data. So back in my ecology days, I did a study of what types of data were being shared by different organizations or by different individuals and what I found was that none of the data was being shared except for genetic sequences because that was the one that had a repository and that was the one that had a requirement but all the other data associated with these publications could not be found anywhere. So I think it's true also for cancer data. I think there's things that get analyzed alongside the data that might go into these more specific repositories that are mandated by NIH but we don't know because we don't really have a lot of information about that. There's also lots of interoperability challenges. Not gonna belabor this point if anybody's ever tried to combine two data sets from two different organizations or individuals, you know that it's virtually impossible. I have hired a person who tries to do this and currently has spent the last two months trying to combine two data sets on ovarian cancer. So it's a really hard thing to do. There's lots of silos. We've heard a little about some of these silos today. In Harvard Business Review, there was a statement about really that we need these massive data sets for precision medicine to actually be able to cure cancer and no single cancer center has enough data for the researchers to gain the insights that they need. And so we saw a really good example of KC making use of lots of data from lots of different groups but that's a pretty rare instance of being able to use all those different types of data to make analyses work. So these silos are not just institutional silos, they're also silos around data types and around disease types. There's also this issue of privacy versus data sharing which is unique to biomedical data. And so there was a really good piece in major research around fighting cancer by sharing data and this is South Korean National Cancer Center. They were putting data sharing kind of at the very front and one of the quotes from this article was one of the biggest challenges is the South Korea's data protection legislation. It's one of the most restrictive. It's driven by an individual's right to privacy but it prevents the spread of information that could ultimately save lives. So this is a real tension and it's something that we can't overlook or pretend doesn't exist but it's also really, really hard to solve. Potential profits are another issue. So these are articles from Geekwire which is our local kind of tech nerdy Seattle publication. And on the top here, you've got adaptive biotechnologies which just had their IPO and did quite well. And then Juno Therapeutics down here on the right which was sold for 9 billion recently. Both of these companies are spinoffs from the hutch and as a researcher working in this space and you see these colleagues, right? This guy, this way, this guy holding his baby. He was the head of the computational biology group at the hutch. He's now of course moved on. He doesn't work at the hutch anymore but you see a colleague that's becoming a billionaire and you have to imagine like, oh, well maybe if I keep my research slightly secret, if I think about how I wanna share it in a way that I can actually profit off of it potentially, that does create problems and restrictions. Not to say we shouldn't find ways for researchers to take advantage of the intellectual property that they're creating, but it does make it difficult. Lack of training for researchers. So I had a meeting with our president and he asked me the question, why aren't people learning about coding in grad school? It's a great question. I don't know why they're not learning coding in grad school. I think more and more people are learning in grad school but it's taking a while. So the Carpentries comes into play here. We do support the Carpentries at the hutch. We have an instructor on campus and we think a lot about how to bring up the level of researchers working at the hutch. We also have a dedicated training arm called Fred Hutch IO which the person who is the Carpentries team runs and she really thinks a lot about how to bring about the skills that the researchers need to make these data intensive discoveries. There's also a lack of training among clinicians. So data and analytics come to the med school curriculum in 2015, figuring out how to access and interpret all the data is not a skill that most physicians learned in medical school. It's not even being taught in medical school. This is something that we've seen. We do have a clinical research data arm and they are notoriously bad at their Excel spreadsheets and they don't know how to handle their data. And they are very focused on the patients and there's nothing wrong with that but it does come at a cost of not having data that we can really analyze effectively. And then finally the wrong incentives. So journals are still the major incentive for most cancer researchers. We just had a lovely discussion from the panel earlier where we talked a lot about all the different ways that you could try and tackle this like data citation, tenure and promotion, thinking about new metrics and having funders and institutions really change the conversation around incentives. But that's gonna be really hard and it's an uphill battle. There are signs of progress. We did adopt protocols I.O. at the Hutch. We have NextFlow being implemented which is a workflow system. Jupyter Hub and Jupyter Notebooks are being promoted. So there's lots of signs that things are looking good. The NIH Cancer Moonshot now requires grantees make their papers immediately free. That was also very exciting news. And generally things like pediatric cancers are really a beacon of hope in this space. We heard Casey talk about this quite a bit, but they really are leading the charge on data sharing. And Alex's lemonade stand was mentioned in this article so I highlighted that. And the NIH is getting involved, right? Lisa Federer two hours ago talked about all kinds of great things that NIH is thinking about as well. So there is hope moving forward and with that I will wrap up. Any questions for Dr. Strosser? So if you got a $10 million grant for the next five years to use as you see it in your own, what are some of the things that you do in addition to everything you do here? That's a really good question. $10 million. I feel like there would, I would want, it sounds so silly, but I feel like at the Hutch in particular everything comes down to communications and marketing. It comes down to selling the idea to the researchers and educating them around this space. And so giving researchers grants, for instance, Julia Loudness at NCIS is doing a lot of really cool work around helping researchers move their workflows into the open science space. And she's doing that through a lot of collaborative work. I think giving people grants to really push their labs into the open science space could be a really interesting way to use that money. Finding faculty internally that are willing to advocate for it and developing out really good materials for promoting open science and advocating for that within the Hutch, I think would be huge. Any other lingering questions before we move on? Okay. All right. Thank you, Dr. Strosser. Okay. So for our final presentation, we wanna welcome Dr. Lynn Schrimmel. She is the associate professor at the Department of Epidemiology and Public Health at the University of Maryland School of Medicine and the Institute of Genome Sciences in Baltimore, Maryland. And she's gonna be talking about open genomic data. It takes a village. I wanna thank the organizers for inviting me to talk. I'm gonna tell you a story about a group that I've been involved with for about 15 years. So about that time, so walk back a little bit in your mind. 15 years ago, we were just on the cusp of a data revolution. Soil and water, genomes are being sequenced and ever a growing number. And then you walk forward today to the amount of data we've been talking about here. I just think as preponderance of data was coming in, however, we had a common problem. That's what the village comes in. All of us producing the data, data repositories, strain archives, we all had a similar issue. We didn't have a way to compare our data. We could look at the sequence and compare our sequence to sequence and that was great, still is great. But we didn't have any metadata standards or a way to discuss the data that was sampled today 10 years ago across five different studies. We just absolutely didn't have a method to do that. So I'm gonna tell you today and share with you a story of is our community coming together and has evolved over time to put standards in place so we could compare our data and can continue to go in this fashion. So what I'm gonna talk about is the framework that's evolved over time in genomic science and it has many facets. One, one of the biggest ones is community of researchers. You need stakeholders in the room willing to spend their time, willing to spend their postdocs time as well and their funds to put these things together. That's a very big commitment by many people. And in order to do that, what we had to do and continue to do is look at different environments we were working in, different types of studies we'd be facing. So we knew the human microbiome project was a few years in the future as well. So thinking of things like that, getting that modeling of what our environment is and how we would put that into a system that we could all reuse. Another aspect of it is literally spending those hours hammering out those standards, what could we identify that was common across all of our groups? We literally spent a couple of years meeting twice a year per week or so with meetings in between to figure out what were those core data elements that we could all compare. That we would all, no matter what study you were doing, you could fill those pieces of data in. It's not great to go look in GenBank or the SRA or biosample and have blank fields. You can't compare blank fields, right? Null is the worst thing that we could probably have other than free text, maybe. Anyway, so we needed to be able to pull this data together, come to consensus and able to grow as a community. The third part is a data repository. These are vital partners in this work. So maybe you build a standard for your product in your lab and that is fantastic. However, if others are outside your research community are going to use it, you have to find a mechanism for them to get to it. And then you have to have a mechanism for them to be able to query that data, do data comparisons. And I'm sure you've all been to conferences where the human microbiome project has been the standard, the data set that's compared against, right? So in order to be able to do that, you need to have standardized data collected throughout that study, put into the systems and allowable to then query back out. So these are very three, very important, I think, areas that we have to capture and have to work together as a team. So that's why, again, the village aspect. And these are a couple of the photos of colleagues I work with. And I'll talk a little bit more about our genomic standards consortium. This is the group that is the back end of this work. So here is our website. So you can find information about our group. We're an open community. We have about 500 researchers now across the world that work with us. We developed the standards. Initially, it was a lot of the work we were doing in the community, like I was saying, soil and water sampling, but it's evolved quite a bit over the years. And again, it's been building the stakeholder community, having the coders in the room, along with the biologists. So we knew what we were talking about. Again, expanding along with the science of expanding. So one example is single cell genomes. That area of work only the last couple of years has really come to the fore, but they again needed a way to compare the data they were collecting and to have it rigorous. And not just to the samples they're collecting that day, but to data sets that have been collected years previous and then hopefully into the future. So that's been driving us for quite a while. One thing I can tell you, I didn't put the slide in it for today, but the standards that we've produced over the years and the ones that are continuing to evolve, we now have over 600,000 records annotated to these standards in the biosample repository. We're always continuing to work to have more of them. But this again, you have to get the community to decide this is worthwhile for their effort to do. You have to make it reasonable for them to get the data in for it to actually grow and have legs. And then fortunately, it's grown over time. Just to give you a little bit of the insight of what we've been doing, we are now over 20 of these we call packages. So when we started, we had to figure out those 10 or 12 terms, say the location of the sample you're collecting or the type of sample. Was it soil or water or blood, something like that. Those are key elements we had to get people to agree to, but we had to make it modularized. So as someone was then gonna study the Great Barrier Reef or they were gonna have the human microbiome project, how could these modularized sets grow for them? So then maybe you wanted to conduct a study of the people in this room, their gut microbiomes, but then also look at the room. So look at the built environment of this room and compare those microbiomes. If you don't have core elements to connect them, it's not quite so easy, right? So that was the idea. And that's again, how we've continued to move forward. One of the core aspects of our community, I said in our early days, you met twice a year, we literally sit in rooms in code and figure out the standards. But we continue to work on a monthly basis with a smaller working group. Because we need to reach out to communities, make sure the standards are built in a rigorous way. Again, this isn't a critical part of what we're doing because the standards we develop and put into these systems are part of the INSTC, so ENA and CBI. So they have to have backwards compatibility between versionings. We have to be able to grow over the years as people have new terms that they wanna include. So this is a group that continues to work. One of our board members, Ramona Walls, she leads this group. We have open meetings every month. Last Monday of the month we're open to anyone joining us. That's why I put it in the slides. It's also on our website. So please, if you have an area of work you want to develop, we're very happy to have people join us. And just to give you some insights into a little bit of a picture of how we evolved, outreach, again, is a very critical, important part of what we're doing. So publishing is another way that we do that. So as we work with different groups to develop their standards, they come together as a community. As you can see, the number of authors in this bottom paper on the left-hand side, that's a viral, uncultural viral community came together in the last couple of years and wanted to have a way to capture that data that was most important to them. So they came together, had lots of conference calls, figured out from all the different aspects of the community what would be important to collect and then publishing it. But this is a way for us to reach groups that we've never talked to. So again, getting out of our own silos, our own little groups that we love to see each other, love to work together when we get together. However, we want to be able to make this available to a larger audience. And I think this has proved to be a fairly fruitful way of doing it because it does reach that larger community and give people ideas. Maybe this would work for them, help them save some time developing a standard, or maybe there's something new we've not worked on that they can help us grow on. Another critical member of this framework though is the journals. And not just the journals themselves, but all of us. So you've submitted papers, you've been reviewers, you're editors on journals. These are critical components of this story that the policies they could put in place in the journals, this is a team effort with the communities. But as editors and reviewers, if those policies are in place, we have to enforce them. We have to actually say, is that data availability statement in place? And if it is, does it actually link to an active record in SRA or in GenBeck? And if it doesn't push back, make them submit the data in a proper way. If they've submitted it into a database at their own institution, that's great. But they also have to put it at a place that all of us can access it. That's truly open science. And again, we continue to push on these things. What I'm showing you here is a number of our journals that I've worked with, members of my board have worked with to get these policies in place for the genomic sciences. We continue to grow this area. Having some big groups though, like Biomed Central, their suite of journals, as well as the Springer Nature community. Having those journals, I have so many journals underneath them, having them put these policies in place. Having them put up sites to explain why our standards important, what databases do we consider the open repositories. That kind of PR spreads the word, right? We need to spread the word to larger groups. I mentioned earlier the data availability statement. This has been a critical element, I think, in getting the genomic science continuing to be open. We push on this. We won't publish, and the journals I work on, we won't publish if this isn't filled in and we check the data when it is to make sure that the data does become available. Because it's not great if you say it's available. A person goes and clicks on it and you're completely frustrated because you can't get to the data. So we're really much pushing on this. One other thing I want to point on the site is the fairsharing.org. If you haven't visited the site, I encourage you to do so. This is a community that's come together that looks at the standards across the board, the protocols, policies, putting them out there so you can find them and find out what other resources they're linked to. It's another way for us to kind of get beyond our own communities. And see what else could be used. Now licensing is one other thing I want to point out. So in genomic science, yes, we have to put our data into the SRA when we sequence, right? And that is a mandate. However, in our community, we don't have a policy in place for what the licensing is on that sequence data. So this is an article a number was put together in Sciences last year. It is a position piece by the community because we've come across this area where you can have data that's required to be open, right? It's in the database, grant. It's not free. The license restrictions are a possibility people can put on their data, even data that is paid for by our tax dollars. So that's a community, right? It has to come from the community. NIH, the funders, they can put requirements in and suggestions. However, our community itself has to speak out, I think, that we have to say it's important and critical to us to do our analysis. And again, this is getting pushed. I think it's important. We do come across the paradox though. It was brought up earlier. You have a large-scale data set and you're funded for it. You're gonna put it into SRI grant. You have postdocs, so you want to be able to publish on this the next five years. What do you do so they don't get scooped? So one thing we're working on right now with the Scientific Data Journal is a new type of publication. It will be outlining, I have this new grant. I'm gonna be producing this massive amount of data. This is what we're going to do with it. And what this is gonna solve is you're gonna have a citation for those SRA sequences. That's the problem we don't have right now. You have to put the SRA in, but you don't have any way for someone to cite it. The DOE is struggling with this right now. So the DOE has a new policy this last year so that sequencing data has to be put out immediately. However, public-funded data, right? So what this again allows us to do is have a paper that can be cited from the initial state. You can state exactly what you're going to be doing with it, your plan to do it. And then someone else tries to, they can't then scoop you, right? That's always the concern. But what I find, and these large scale, and again, data parasites comes up, the idea of it. These large scale analysis, maybe you wanna look at all the viruses in the world that come from a certain environment, right? They're not necessarily trying to do the same study as the submitters were trying to do, but these papers then put it in place. We have a way to cite it. You don't have to have them as co-authors because of the DOE policy right now. If you wanna reuse data from one of those studies, you have to contact every single author of that, a submitter of that data. And that's not tenable. So I think this actually is a good solution going forward. I don't have to touch on fairness too much. Thankfully, it's been brought up a number of times in this audience. I think this is a great guideline to think about each of us when we're doing our work. How can we make it more fair, more fair? How can we make data sharing, how can we make data sharing work for our community? And this is really, I think for each of us, thinking about how it could work for us, how can we make our data more fair? I like that aspect of it. And so for genomic sciences, there's many ways, very practical ways you can do it. You can use a standard, use a biomedical ontology. You can put it in open repositories. You can have a clear usage license on it. There's many, I think, small steps we can all take that make the data open and reusable. And these are, I think, quite feasible for all of us to do. So again, coming back to my point at the beginning, it is a village to do this work. 500 people over 15 years putting standards together. But again, I did borrow a little, from the three musketeers, all for one, one for all. And it is very much we are in the same battle. So what I'm showing you on the slide is groups we're working with right now to put their standards in place. Thank you. The agriculture microbiome community, we've almost got it finished. They've actually put a position paper out where they got the community to give them feedback. Parasite microbiome paper was just published in PLOS. PLOS pathogens. Again, these are various communities that have large-scale data sets. They're not necessarily having an expert in the room to how do I put this metadata together. What our role as the Genomic Standards Consortium has evolved to in some ways over the years is helping these communities get through this trouble. The difficulty of what terms can I use for my standard? How can I reuse terms from other standards so I can compare data across? And how do I get it into the INSDC without creating a new pipeline? So thankfully, these are in place and we're always happy to work with new communities to help them put their standards in place. And just one last little slide. We have our annual meetings for the Genomic Standards Consortium. This is our focus this year. We'll be on precision medicine, agriculture, comparative genomics, and metabolomics. And so the reason we work on these is we want the communities to talk to each other. We want standards to be evolved. So please join us if you're interested in Thailand next July. It's open to anyone who wants to come. Our registration opens on our website in January. So thank you very much. Thank you all. Thanks for the fascinating set of talks. One common theme that seems to be popping up is the necessity for engaging the broader community to invest in these types of efforts. So I'm wondering if all three of you have identified strategies that seem to work or that seem to help foster a community engagement for researchers all over the world. So one way for engagement. We have to make it so to them, they feel that they're gonna get something positive out of it, that their biology is reflected in the standards and they're their standards. The viral community is a great example. They developed the standard. We helped them. They developed a standard that was important to them. And if it's captioned data that you need to capture, then that's where you get to buy it. And then the number of references already, and it's only been published in January, we're getting a large number of references already citing it, which means the communities engaged in it and they find it interesting. But you're also making an example. So if you're the majority of key players, see the uncultural viral community and they're all publishing with this standard, then everyone else, if people go look for what's being used, you're leading by example. So I think that's a critical component. Getting people in the room to talk to each other is essential to all of these, I think. I will say that it was a lot easier when I was a funder. Because I could just make people. But unfortunately, that doesn't work as well in my current situation. I think that pure pressure is a really valuable component of all of this. So if communities really buy in and they start talking to their colleagues and they really encourage their colleagues in a positive way to participate, then I think that's really the gold standard in terms of getting people involved. And I'm really seeking out now faculty at the hutch that I can tap for this. Because I feel like, yeah, they're not gonna listen to me. But they will listen to their colleagues. At Protocols.io, there are many things because of the challenges we talk about and the lack of time. There are many, many things we've been thinking about, trying hard on the user interface. There can be simple things like, we realized in the beginning, most people that heard about Protocols.io heard about it for me talking about it. So they knew, they saw a demo and they knew that you can ask questions about the protocols and we'll go to everybody who's using it. And then we realized as it grows, most people come to Protocols.io from Google now and they haven't heard me present. And they're engaging with it, they're engaging with it like with a research article, it's a PDF. And so just a simple interface change that puts in your face that you can click here to ask a question that'll go to the author and everybody who has bookmarked it and the entire group that the community that the Protocols.io is part of made a huge difference in participation and conversations and community engagement. So there can be small, tiny things we can do with just making it easier to engage in the conversations. But for us, there's no silver bullet. We have an ambassador program where on five continents we have students in postdocs that we get a monthly calls with. They're the early adopters, they start communities on Protocols.io, they wrote people in. The universities like the Hatch and CMU when they sign up, every researcher from the university can send us a protocol and they put it in for them. Right, so the onboarding, making it easy. And always for us, I will stress this again, just being obsessed with your busy, how do we make it easier for you to engage and be part of the community? And it's really hard. Hi, I noticed in all three of your talks that you touched on the challenges for early career researchers and how, you know, our open science efforts, you know, sometimes they're sort of, you know, maybe burying a greater burden or there are already so many more pressures on them today than there were 30 years ago that these are sort of compounding them. So my question is, one, is if you have any additional thoughts on that, but also is if you are aware of any efforts to address this challenge and that could be things like additional training, changes to doctoral curricula, changes to postdoc, service or requirements or training, collaborations with organizations like the council graduate schools, anything like that. Yeah, it's definitely a burden, but I also, I think of it as, it's similar to the conversations with diversity and inclusivity and equity, which is often the burden falls on the younger researchers to really bring this to the forefront of their organizations. I would argue, you know, it's, yes, it's a burden, but it's also like, this is how we should have been doing it all along. And so I think we have to, again, work on that education piece, particularly for the senior researchers, right? And explain to them how critical it is for these processes to be in place and for researchers to be using them. So what we find a lot is that the faculty members that run the labs are not involved, as we talked about in some of these more labor intensive open science practices, but their graduate students and their postdocs are really excited about it and really want to implement these systems and make this happen, the same way with DEI things. And they just need to be given the space to do that. And I think that's really where we have to think about incentives, but also thinking about how to educate the people that are in the leadership position so that they get credit for these activities that look like open science. Just one other point. So I'm at the sequencing center, right? So one other thing to think about is how can you put the standards in the open science aspect of your work into the infrastructure? So it becomes easier. So in the early days, we were, I was a metadata person for the agency. Like it was one person in a small group scraping the data. We've evolved quite a lot. It's part of our limb system. So the standard pieces that we need when the sequencing is being done. And it's also part of our, the other, sorry, I don't know if you got the best way to put it. The infrastructure for the analysis, the workflows. So the analyst aspect of it, they don't have to do it by hand anymore. And I think this is actually a big win. There's a researcher, Rob Knight, he does microbiome data. Anyway, so he's got a tool called cheetah, or time that evolves. But he's put it inside of the infrastructure. So people don't have to play with it and figure out what it is anymore. It's become kind of part of the system. So getting the aspects of the open science that we need into our systems, then kind of alleviate some of the burden on the students. No one has to scrape PDFs anymore. For this part of it anyway, as it gets in the infrastructure. I would encourage that aspect of it. We'll hear about binder today. I think, you know, the same way that Microsoft Word replace the typewriter when you're typing up your thesis, or like technology should enable and it should save time for those students and postdocs. And when we talk like metadata, it's so important and it's so hard, right? When you're depositing data, if you have 50 fields, like just thinking about, well, if you're already keeping it in your electronic lab notebook, if you've recorded that, what about API based integrations where you click a button and those fields get repopulated instead of you getting a Microsoft Excel spreadsheet or having to spend two hours depositing. So technology I think is part of the solution that should be saving time rather than just becoming extra burdens. But the other part that I think makes me optimistic, there are signs of progress, like Carly said. If we think about ALSF Foundation, what Casey was talking about, like they're looking for collaborative people, right? Like they're changing incentives. Couple of weeks ago, a new degeneration fund, the Parkinson ASAP initiative came live and their first RFA, the pre-proposal actually asks you to talk about your collaborative history. Not just a vanilla statement that will collaborate, but who have you been collaborating with, right? Before you submit a proposal, what's your collaborative history? How open are you? Do you publish open science, open access? And they will score you on that and then invite people to submit the actual proposals, right? And I see more and more funders, more foundation trying to encourage that, Chien-Zookerberg initiative website. I think there are reasons to be positive that it's not all just hands up in the air and like this is impossible. I think I'm next. When people adopt practices like those taught by the Carpentries or Protocols IO or take on metadata schemas that are shared, it seems to have an impact on hearing today at two different scales. It can make the lab more efficient. It can make it easier to collaborate with future you, as Lenny said, and it can also enable these meta-analyses that we think of as data science, big data, reusing the data in a way and a scale that wasn't anticipated. Is either of those efficiencies or possibilities of a stronger sales pitch to these senior people that we need to incentivize the space to adopt these things? Are they both part of the story or is one growing stronger for getting by? I'll take a stab. In my experience, in the biomedical space that I work in, which is limited, the more senior the scientist, the further on that graph they are on the left and they just, the idea of big data or data, I mean, I have had to try and describe what data science is more times to senior people in our organization that I care to think about. I started calling it my boss and I both started calling it data intensive research because it seemed like data science was just like too much for them. And so it's really hard to sell that aspect. They know it's important. They know they're supposed to care about it, but I have a theory, I know this is being recorded, but I'm pretty sure no one at The Hutch will watch it. I have a theory that it makes them feel not smart. It's like not smart, right? Like that when you start talking about big data, when a presentation like Casey's gets shown to somebody who's a lead investigator and they don't think they're just, that's so far from the areas that they are comfortable working in that I think it ends up being a bit of a tough sell. I have heard researchers, including program officers at the Moore Foundation, tell me, I got through my education and training just fine. I'm now, I'm a tenured Harvard professor and I never did any of that. So who cares? And so there is this, there's a problem with focusing on that as the only way to try and convince those senior folks. I do think that those senior folks are gonna need grassroots from their labs, people insisting, and I think they're also going to need mandates. So I think we have an opportunity when we have large, not consortium, but conferences. So the American Society of Microbiology for a number of years has had town hall meetings discussing the issue of how do we get the various projects that are working on microbiome and metagenomes and standards and shit data sharing. How do we federate that to a larger system? So over the last two years that's actually evolved and town hall meetings just like this, people are willing to put their hands up, say this is my problem, this is my issue. It's gotten to the point where we now actually have national funding for, it's called the National Microbiome Data Collaborative the next two to three years to actually reach out to communities. But I think it's a groundswell. Like you said, it's gotta be grassroots. People have to have a buy-in. You gotta get the people in the room that are doing the funding and signing the checks and also running the labs. And I think when you do that, that's what that community really worked hard at for a number of years. Getting leaders in the room as well as people that are gonna do the work and then getting the incentive. So DOE gets it and thankfully, and there's funding for that. I think NSF is coming on board as well. They're very positive about it. And we even have some NIH people involved. So I think we have to keep talking. Talk to your program officers, make the point to them. Talk to you, every time you give a talk somewhere, talk about data sharing and open science. Make it part of what you're doing and show the positive aspect of what you get out of it. And what they will get out of it. I think that is how we have to kind of like PR ourselves a little bit and do our own marketing. My perspective may be a little bit specific to protocols IO. We don't have as hard of a time convincing faculty or younger researchers because it's not just about reproducibility. It's not just about reproducibility, but it's also about, oh, people graduate, they leave my lab, where are the methods, right? So the faculty get it and want people in their lab to use it in the same way that students get excited. But what we've noticed on protocols IO is there's much better adoption spread within a lab if it comes from the student and postdoc. So when the faculty get excited and they say, oh, I'm going to tell my lab to use it, we actually say like, they're probably going to ignore you. There are some labs where the PI says this is how it is and they run it like a biotech, that's a minority. So we have a lot of instances where the faculty will say you should use it, but our reaction when something comes from our PI is like, what does he know? He's writing grants, he's not in the lab. Who's he telling me what to do day to day? He's not actually here. So what works better for us is when we tell the faculty, actually at your next lab meeting, if you let us present for five minutes via Skype and answer questions, then instead of it being a PI telling the postdocs and students to do it, the students are like, oh, this is great. I should use it, right? So we are sensitive to those considerations and it for us specifically tends to be the students and postdocs that are the adopters and that spread it. Yeah, we got it by same idea that people are producing the data, started the conversation and we showed how much we needed it and then we kind of proselytized it, did our own PR, our own marketing and got it into our system and then we pushed that on the scale. Yeah, and I think that does work. I need to mention that. So I, firstly, should say that somebody explained to the buildings people that there were people here today who are not from Scotland and therefore they turned the heating on because they, and that's the noise. Lenny made the point in his remarks about the role here of the libraries and the library faculty who are not necessarily librarians but are experts in this field as people who can draw together a lot of the institutional infrastructure. I'm conscious that in some universities that hasn't quite kicked in but as they realize that the books are going away we need to find something else to do and that will change trajectory. But I think one of the benefits of our approach has been our ability as independent neutral brokers to evangelize and undoubtedly I like the distribution and certainly the early career researchers are the ones who are demanding this because early career researchers have grown up in an environment where their lives have largely been open and it's normal that their professional existence also should take that course. Conversely, we know that a lot of senior PIs just don't get it. What we have done, and I'm going to look at Anna and hope that's been helpful is I've been working with my fellow deans and with the vice president for research and others to come at the top down approach as well. Certainly the grassroots movement is critical but bypassing the obstacles and going to the people who have the resources to throw at things but who also have the power to convene and their colleges is quite important. I'm not expecting responses and there was another question which I just wanted to make that point. I think it's a dual approach. You have to, many layers have to be talked about. Yeah, there's a more question, I mean. Maybe now, yeah. So in response, so there have been a couple of remarks about this point in which scientists that have been trained longer ago have less knowledge but a lot of scientists also of my age. So I went to grad school now like seven years ago. I mean, it finished a year ago but and we're still as a neuroscientist and life scientist, not really taught ever in grad school, masters, college, how to do data analysis in a reproducible way. And so there are projects like Dr. Julie Stewart-London. She's an ecologist and oceanographist and she was in the program with me in Mozilla and her entire work in Mozilla was to building a training program for other researchers in the oceanography to actually help them learn how to do get up, like create the get up people and onboarding protocols for researchers to come in. And she found incredible results with that just having these like core base and collaborative training. And so this project struggled to get funding. And so I always see how software is all and I think that I agree with Lenny, like software really enables and should enable but we really need to encourage also these like projects that help not really shame other scientists but help them actually getting on board and her champions as you call were not in a generation of like the heroes. Like we were talking about like 40s. And so I think that we should in open sciences course like also accompanies those remarks of like how all the generation cannot do that with like, oh, we actually should think about how can we help other if you have any. This was just kind of in response to what you all were talking. I love Julia's work and we had her, I'm trying really hard to get her to develop out of system for biomedical research and bring it to the hutch. That would be super awesome. But yeah, we have these Fred Hatch IO this training program has Intro to GitHub, Intro to Python, Intro to R. They're all four to 10 hours of course work spread out over a couple of weeks. And there's always a wait list for these courses. And we hear from a not insignificant number of graduate students in postdocs that their researcher, their advisors don't want them wasting time on those courses. So yeah, I agree. This is exactly what we need is more training, more advocacy. I think that those researchers in those labs have got to push back. We've got to find ways to advocate for them to get that experience because yeah, it's definitely, it's not that everybody from an older generation doesn't get it. It's just, it's, we just have to make sure that they're seeing the value and that they're pushing their students and their postdocs and their staff scientists towards that. And you would think, you mentioned neuroscience, right? There's, you would think something like preprints, considering the increasing, what I showed, increasing length of time to publish, that it should be universally welcomed, right? Senior junior folks, but I see variability and it's not just in the biomedical, in physics, the adoption of archive that started in 1991 is not the same across fields. And I think there's also variability not just between generations, but between fields. So like the vine chromatics, model organisms, genomics community that I come from as a yeast geneticist, I think has embraced bioarchive. Neuroscience I think is a little bit more hesitant than I was seeing in the kickoff for neurodegeneration grantees of the Chan-Zookaburg initiative about eight months ago. They just had an open conversation around preprints and there was a lot of hesitation. Like, get scooped and CCI was leading sort of these lunch tables and addressing this, right? I'm teaching people about bioarchive and that it's not a threat and that it's good for you. And so this training and support from universities, from funders, right? It's really simple, common sense stuff, but getting people on board is work and it requires attention and winning the hearts of mind. Yeah, one last comment on that. Community building is so hard and Julia can talk about that. I can talk about that. I'm sure, yeah, yeah, exactly. You give it, yeah. And so that community building is so hard and part of this is community building and getting people credit for community building and giving people help and recognizing that it's extremely hard work that you can really get fatigued by very easily I think is also part of that conversation we're having. All right, let's thank our presenters one last time. Thank you.