 Hey folks, this week I'm in Washington DC at the American Society for Microbiology's annual microbe conference. Earlier today I was lucky enough to receive the inaugural ASM microbiome data prize. I was so proud to receive it. This prize was also sponsored by the National Microbiome Data Collaboration, the NMDC. They do a lot of great work to encourage folks to release their data, make their data publicly accessible, but more than that they really put in a lot of great effort to helping people think about metadata and how we can add more value to our sequence data and other type of data that we generate in microbiome but also more generally. So I hope you enjoy this talk and again thanks for all the well wishes that I've received and really appreciate the support for our efforts to engage in open research. Thank you very much to the organizers and the NMDC and ASM. It is a great honor to me to receive this award. I hope I'm not reading too much into it but it's a particular honor to be the first recipient of this. Those of you that engage in doing open science sometimes you wonder if you're fighting the battle that's not worth fighting and so receiving awards like this really really does make it worth it and is a good indicator that to our trainees that this stuff is important. So thank you very much. Question that I want to ask you all to motivate the rest of my talk. I don't know if you're familiar with a guy named Simon Sinek but as a book and probably Ted talks out the at your ears that basically asks you to start with why. So why do you do science right and so I don't want to know but if you're like me when I answer this question I do science because I am on a search for truth. I want to know how the world is put together how things work and and I think the world is beautiful right and I want to understand how that beauty translates into ways that we can help each other and how we can help the environment right but to do that I can't do it on my own right. I have to build upon the work of others I have to stand on their shoulders so I can see further and I know then that I have a responsibility that others have to be able to stand on my shoulders right so I need to make broad shoulders I need to try to be tall so to speak right maybe we can push this metaphor a little too far but we all build upon each other and if you're doing closed science if you're making everything proprietary saying the data are mine they're not yours then you're really narrowing your shoulders right you're making it harder for others to stand on your shoulders and to help us to see further. We have big problems I don't know if you've noticed that over the last couple of years but we have big problems and science I think is at the heart of solving those problems so I've got a number of reasons and this is a very partial list for why I think open science is important and we'll spend a few slides talking generally about each of these points so I think open science is important for the rapid release of information I'll start with a picture here on the right so three years ago when I think we were in San Francisco I was walking around with just a huge weight on my shoulders because I had just been diagnosed with Hodgkin's Lymphoma and I hadn't told anybody and it would be a few more months till I tell anybody and it's just this just huge burden on me because what would it do to my family and so the next week my wife and I went to go see our oncologist and she brought in this conference abstract that had not been peer reviewed and had this single figure on it okay and my doctor said and I looked at it and like okay that's a survival curve I don't know what all those abbreviations mean but the red line and the black line that you can see kind of tracing each other the red line is radiation alone and the black line is radiation plus chemotherapy and she said based on this 20-year study that was just presented at a conference without peer review I don't think you need chemotherapy okay for the type of cancer I had right so this impacted my life dramatically right it meant that I could go into the pandemic without having to recover from chemo right and it meant that my family wouldn't be so burdened and so releasing data affects people and if you haven't noticed right there is a pandemic going around right and so the availability of preprints and open access publications is all instrumental in getting information out to our colleagues right imagine where we'd be with a vaccine or therapies if all these scientists hadn't been releasing their genome sequences right we would be just in a horrible position right and I'm not saying that preprints have solved all the problems for they probably caused some problems but I think it certainly helps to move science forward and really allows us to build upon each other and we also need to think about this idea of developing these community resources like sequence data and like the great microbiome data that we've been working with I also think it's important for reproducibility and replicability when we do a project in my lab we always start with the last project as our starting point right and I think that's kind of what we all do in a way but I want to see that people in my lab and myself can recreate the last result we had because then we want to build upon that right so if we develop a new tool well we want to see what people did previously and we want to validate it using their data because we don't want to you know do a bait and switch and make their program look worse with a different data set but we want to be able to reproduce what they've done and so this table comes from a paper that I published in in bio a few years ago talking about reproducibility replicability robustness and generalizability and so we can think about same methods but we can also think about same data or I'm sorry same methods and and going across we can think about different data right so same data or different data right so I can look at the biomarkers for colon cancer in my cohort but I could and so that would be reproducibility right and then I could look at other cohorts and that would help with replicability right and so then as we as use different methods different angles of looking at the problem we can think about robustness and generalizability but that all starts with getting the data right to expect me to generate data for multiple cohorts worth of samples it's just a waste right it's just ridiculous and that would be tremendously selfish for people to not share the data because and that slows our progress forward and of course none of us have all the good ideas right like you all are smart you're here after all right um but but we don't have none of us each have the good ideas and so ways that we've thought about this in my lab is as I've mentioned benchmarking new methods and the ability to benchmark those methods with diverse data sets I'll talk more about that here in a few moments we come up we're constantly coming up with new analytical methods right and so being able to get access to previous data sets or previous samples is really important and something that I've been dependent on using samples for biomarker studies that the PIs had no idea what the human microbiome was but because we had a new strategy we could go after those and then of course we all have new questions right questions that the original investigators could not have foreseen I think open data is also really important because huge huge investments have been made for every paper that gets published right we talk about the kind of out of control article publication costs just to publish a paper of a few thousand dollars well if you look at NIH funding and the number of papers that come out of each NIH grant can serve things like indirect costs it's about a quarter of a million dollars for a paper to get published right when you consider personnel money reagents getting samples sequencing publishing right that is not cheap and so beyond the financial costs there's also intangible costs so again if I go out and I recruit a cohort of people with and without colon cancer I'm going to go up to people that have just figured out they have colon cancer right so how many times do you think we want to do that to those people right we want to minimize that as much as possible also we have we're working in environments that I would consider kind of scientifically sacred so whenever I see somebody has gone off to the beautiful Galapagos Islands I wonder does the Galapagos Islands needs another person traipsing around screwing up that environment probably not right so we need to be efficient with the data that we use from these scientifically sacred sites and I think what this also comes back to is what does a published paper represent right do you see the papers that you publish as the final answer on whatever you are studying and so if you think that and I don't know I think maybe you need to drop your perspective a little bit because really we're all on a step right towards something next and even Nobel Prize winning science is a step to something more right and so what our papers I think represent is an advertisement an advertisement for ideas for our data for our methods for our reagents for our tools right and so if we think of our paper as an advertisement well I want to sell my ideas right I want people to read my papers I want people to use my data and so we really need to get out of this proprietary framework that I've published this paper you don't have rights or access to the data or methods in those papers that it's mine right so needless to say not everyone agrees with this and including the editors of the New England Journal of Medicine who a few years back wrote this editorial on data sharing and this is like one of the things that Twitter just loves to go nuts about that in here they have a quote there is a concern among some frontline researchers that the system will be taken over by what some researchers have characterized as research parasites so I know people use words loosely but I consider myself a microbial ecologist and a parasite is an organism that benefits at the expense of something else right so it's a positive negative relationship so if I make my data available to you does that hurt me no right that helps me that helps me with citations again that helps me to get my ideas out right it's a mutualistic interaction not parasitism needless to say like this kind of blew up on them I don't know how much they've retracted but it did give rise to another type of award called the parasite awards celebrating rigorous secondary data analysis which I think is just the the appropriate response to claims of data parasitism and one of the awardees of this is Julie Dunning-Hutop who's an outstanding microbiologist at the University of Maryland and she has a series of work that I absolutely love where she basically takes throwaway data and reanalyzes it so in this study that was published in plus computational biology she took cancer genome sequence data and instead of throwing away the bacterial DNA as a contaminant she included that in her assembly and lo and behold the bacterial DNA assembled with the human DNA primarily in the chromosomal DNA and if you haven't looked at her studies they're fascinating there's another where they find entire chromosomes of a wabakia in Drosophila genomes and then they go back and they do further studies of course not just sequence analysis to biologically show that this is the case and think about mechanisms right so the data that they're borrowing again is a case where the original PI's authors had no concept because of their expertise right but Julie is interested in bacterial genomics horizontal gene transfer evolution host microbe interactions and so she sees the world differently than the oncologists right and so this is one example of many of course within our own microbiome field this is I think a fairly popular editorial at this point available upon request is not good enough this is an editorial that basically is a takedown of a european study that published thousands data from thousands of samples that they then kept behind a paywall and would make PI's jump through tons of hoops to use I've never used this data set because I can't get access to it so if I can't get access to that data set I'm not going to cite their work right and so what what others have found is that when authors say available upon request it's almost like you don't need a study to do this because we've probably all emailed an author for data that 93% of the authors who say data available upon request do not provide the data to the person asking for it right so this is just a non-starter so what are we talking about in terms of scientific data so the NIH has instituted a policy that any proposal submitted after January of 2023 so seven months or so that all that PI's have to have a plan for data management and sharing and they define data as this that the recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings regardless of whether the data are used to support scholarly publications right so I think this is a really ambitious definition uh as someone that's received this prize now perhaps I shouldn't admit this but that last clause regardless of whether the data are used to support scholarly publication I'm not there yet right I need to improve I need to get to the point where we're submitting our data even if it doesn't go to a publication right we all have the best of plans to publish something perhaps it turns out to be a negative result and those you know get pushed down a little bit further in terms of our data release but we need to do a better job of getting all the data accessible and I'm proud to say that within ASM's journals uh microbiology resource announcements is very happy to publish data so that people can get citations for that even if it's a negative result in fact you're not even allowed to describe results for a microbiology resource announcements paper within ASM journals we have a data release policy um where we say that it's expected that this data will be released to the public no later than the publication date of the final article I would prefer to see this at the time of submission because then we can put the enforcement of this on the reviewers and editors rather than the editorial staff who doesn't have the domain level expertise to go through and kind of make sure that you know your accession numbers link up and everything and make sure that all the data is actually there so in my closing moments I want to talk about the the efforts that my lab has taken to both promote open science as well as benefit from open science so a number of years ago um the SRA was a royal pain in the rear for uploading data to and I think it probably still is but I don't realize that because we put a really cool function within our mother software package called make dot SRA that basically streamlines the uploading of the data using min marks data package standards and we did this in collaboration with the folks at SRA and um they were only too happy to help us do this since then we've also created tools in mother so you can directly download and use the SRA formatted sequence data and so that's a tool that that we have in mother and many have benefited from also because I recognize we're bringing a lot of people in from other fields into the microbiome world that there's a lot of training that needs to happen and so over the years I've created a set of resources at riffamonis.org um I have a long tutorial series on reproducible research that has a heavy emphasis on data accessibility and data availability and if you look at the numerous youtube videos that I've been making about uh reproducible data analysis you'll see in there that we're also using a lot of open data sets to promote reuse of data so I also find that these um data data being available is tremendously useful for benchmarking new methods so this is a figure from a preprint that will hopefully be published later this year looking at the effect of removing rare sequences from your community analysis and what effect that has on downstream processing right and so in these um simulations we use data from 12 different types of communities and imposed an effect size and basically what I was able to show was that as you remove basically like singletons double-tons and rare sequences you lose statistical power to detect differences between different treatment groups right and so but what you can see is that by using 12 different data sets it's not always a clear cut picture right but by looking across a large number of data types we're able to get a more rich more complete story and and going back to my original bioinformatics tools of developing a tool called mother or daughter rather before mother um we've always relied on the generosity of others to make their data open so that we can do these types of analyses over the years we've also engaged in doing meta analyses and so in a meta analysis uh one such example is looking at the relationship or association between microbial community structure and obesity status in humans and so this was a observation that was originally reported by Peter Turnbaugh when he was a student in Jeff Gordon's lab and so uh it's been a bit controversial in the field so what Mark C a former postdoc in my lab did was to collect data from 10 different studies these studies most of them had nothing to do with studying obesity but because in their metadata they made height weight BMI obesity status publicly available Mark was able then to go through these 10 studies study them separately for this question of association between obesity status and diversity to show that well two of these 10 studies did we see a significant effect as you can see the effect sizes are miniscule and then statistically like a meta analysis pull the effect sizes to show on this bottom row that they're overall yeah there is a difference in obesity of lean and obese individuals is that biologically significant eh I doubt it right and then when we look at things like ratio of vector eddies or formicities we see nothing so Mark then went another step and so like I mentioned before that making data available also allows us to do new types of analyses so what Mark then did was take these 10 data sets and separately to generate 10 different random forest machine learning models to predict obesity status so for this first one the first column of Baxter on this black circle is the model performance of the model the accuracy of the model based on the Baxter data set he then took that model and ran the nine other data sets through that model to assess accuracy to predict obesity status right and so what you can see is it's very noisy and we do a really bad job of predicting obesity status based on the types of organisms in these microbial communities again this this would not have been possible without these PIs making their data accessible and making their metadata accessible as well and and there were studies that we did not include because data were not accessible and metadata were not accessible or they were the data were available but no metadata was available not even the data that would be necessary to reproduce the original work in the studies they basically checked off the mark the block saying they made the data available even though it was completely worthless and so as I mentioned providing the metadata is just critical right if you're submitting sequence data but not providing any other information about it it's worthless you're just you're wasting storage space and so one effort was proposed about 10 years ago min marks it's the minimal information about any marker gene sequence and and so just for each different type of environment what is the minimum information that we should be providing to add context to our data more recently within the human microbiome field the storms checklist was created at again is a checklist a tool that we can use as researchers to enhance the utility of our data and of course the MMDS has great resources as well for other ways of improving the utility of our data and so what I want to close with is a point of encouragement in a way that we can look forward so one of the big concerns people have about making their data publicly accessible is that it will allow our competitors to scoop us well first of all I think you have a five to ten year head start on your competitors and if you can't publish your work in that time span then maybe different male maybe you deserve to get scooped a little bit I would say if you release your data you're going to slow down your competitors right because now they need to analyze their data as well as your data to give more context to their data this is something that we're not doing a great job of in the microbiome field right there are numerous colon cancer data sets out there so if I publish a new colon cancer data set I should get in the practice of analyzing it in the context of these other data sets as well and so I think that will further help move things around long but it's only going to be possible if we release our data and release our metadata and so finally whenever awards like this are given I feel like it's it's sure it's to me it's my name but it's really the people that have been in my lab and have believed in this mission of open science throughout the years I remember when we started kind of going down this path I think people were a little bit uneasy right am I giving away my career are they giving away their career by releasing their data and I think awards like this the tremendous citations we've gotten on our work the prestige people hold all my trainees is a testament to the fact that it does pay and it moves science forward more so thank you all so I hope you enjoy the talk that I gave to you today and we'll see you next time for another episode of Code Club