 everybody for coming. So let's get started with session one. Our first speaker is El Presto. El Presto is an associate research professor in the Department of Mechanical Engineering at Carnegie Mellon University and a member of the Center for Atmospheric Particle Studies. So his research focuses on pollution emissions from energy extraction and consumption in the subsequent atmospherical transformations that these transmissions undergo. These emissions undergo. He also works closely with community groups to share air quality data directly with the public. So El, feel free to share your slides. And I wanted a little overview of the format. We have three speakers of the session. They will each talk about 15 minutes and then we will invite them back to have a panel discussion. So at the end of each talk, I invite audience to ask your questions. In fact, I encourage you to type your questions in, send your questions through chat during the talks already. And then for more in depth questions, we can save till the panel discussion. So I can see your slide. Yeah, and you can see them? Yeah. Okay. And I need, I always need to check. You can see the slides and not the presenter view, right? Yes. Great. Okay. Thanks for the talk this morning. I'm going to talk about how we approach open data in this center that we have that's called the Center for Air Climate and Energy Solutions or CASES. And CASES is basically a big project that's funded by the U.S. EPA. And so it is important to be able to sort of share data within CASES because CASES itself is a big collaborative project. It's multi-institution international product or project, excuse me, that the leads right now sit at Carnegie Mellon, which is where I am, and the University of Washington. But we also have collaborators that all of these places list on here, BYU, Texas, Minnesota, Virginia Tech, Imperial College, London, and Health Canada. This is a big sort of group of people working on this and we need to be able to sort of communicate both internally and externally in a way that is efficient. And so what we do in CASES, because I figure most of you are not familiar with it, is again, this is a basically a big project that has to do with air pollution health and air pollution policy. So we have sort of four major goals. One is to measure air pollutant concentrations in different places across the country to understand how that impacts various communities and vulnerable populations. That's the part of the work that I'm most involved in. A second goal is to develop tools for scientists, policymakers, and citizens to basically understand air pollution sort of, you know, at a national level down to a local level, and to be able to sort of put cost to that. So to be able to assess sort of social costs and future policy impacts. Some of those future policy impacts we're evaluating within CASES, right? And so you can imagine different scenarios for how we treat vehicles or how we treat electricity generation. And so we're evaluating some of those policies within CASES. One of the big focus areas is actually food production and how that impacts air pollution. And then lastly, we want to get it health. So we have epidemiologists on the team and so understanding how sort of what are current health effects and what would be the health effects, potential health effects under these future policy scenarios. And so to meet those four outcomes, we're organized into five different projects and I'm not going to go into super great detail on the projects. But basically two of the projects, project one and project three are both about model building and so we're building different types of air quality models. We're connecting those models to project two, which are the these field measurements. So we're actually measuring air pollutant concentrations in different cities and tying those back to those models. Those models feed into project four, which is this policy scenario application. And those models also feed into project five, which is the epidemiology. One way that we often sort of show this is this diagram that we call the Starship Enterprise. And so all of the arrows here are showing the connections between the projects. And the point here is that there's a lot of interconnectivity. Project two, which is the measurements, really has to communicate to projects one and three. Projects one and three in turn have to talk to projects four and five. And so all of that conversation has to happen sort of within CMU and across all of these different institutions. And so it's important to be able to sort of do that communication in a way that is beneficial so that we're not bogged down. We're also in cases building a pretty diverse toolkit. And so out of each of these projects, we have tools that we want other people to be able to use. So for example, in project one, we're building mechanistic models. These are models that represent all of the physics and chemistry that are happening in the atmosphere. But we have those at sort of the high level of detail and also what are called reduced complexity models, sort of a lower level of detail. And we want to be able to use those within cases, but also people external to cases. We want them to be able to use those models as well. Similarly, the models in project three, we want to be able to have people use and we want people to have access to these measurements that we're doing in project two. And so when we think about how we need to share data and results, we really have sort of three audiences, right? One is within cases, then we need to talk across all of those arrows that I showed a few slides ago. One is we need to be able to communicate our results and share our results with academic and government colleagues, right? So this is funded by the EPA and the EPA is interested in the results. And there are researchers at EPA who want to be able to use the tools we're generating. Similarly, people at other schools want to use the outputs, maybe not get into the weeds. And then also with the broader public, and I'll show some examples from both cases in my own work about sort of linking and sharing data more broadly with the lay public. And there are a couple drivers for, you know, how we're approaching open is that, you know, there is pressure from both government and journals to make data more available. It's no longer sort of appropriate or really maybe the appropriate people let you slide, but it's no longer really acceptable to just put at the end of the journal article, oh, email us if you want the data, right? Journals and the funders want to see the data sort of published somewhere and more accessible than that. And a lot of lay people do want access to data, right? People are buying little air quality monitors and they want to be able to access that data. And I'll show you some examples of that. So one example of that is something called the ramp. So the ramp is stands for real time affordable multi pollutant sensor package. And that's what's shown in the photograph here. And so the ramp is a low cost air pollutant sensor package that we've developed at CMU in collaboration with the company that used to be called Sensive Year that spun out of CMU. And what we've done since 2016 is maintain a network of these ramps. And so the map just shows this is Pittsburgh. We at our height had sort of 50 ramps around Pittsburgh and some of the suburbs. It's in shrunk a little bit, but we've maintained this network for the better part of four and a half years now. And so a lot of these ramps are located at people's homes. So you may be able to see that this is actually a ramp on someone's porch, right? That red beam is a support beam for someone's porch. Some of them are at schools. Some are at businesses. Getting these out is very much a retail sort of enterprise. And often the people who host them want the data. They're hosting it because they're interested in this information. And so we share the ramp data in a few different ways. One is we've come up with a report. And the report has a set of figures in it that we've sort of honed in on by working with people in social decision sciences at CMU and also with talking with the various hosts to get a sense of what they want to see. And so we have a report. And today or tomorrow we're going to roll out, we're going to put sort of publicly a whole network report that we will update monthly. So that's one data product that we share. Another is for a while we've been running a real-time map that, rather than having a map with a bunch of dots on it, this interpolates between all of these locations where the ramps are in sort of real-time. And then if you see the yellow box here, you can click around the map and get a sense of what the pollutant concentrations are. They're indicated by the color scale here. So darker reds are higher concentration. As you click to different places, this speedometer at the top sort of moves to the right position. So you have sort of multiple ways of interpreting this data. And this is good for someone who maybe wants to just get a sense of what is the air pollution in my neighborhood right now? Is it sort of good versus bad? This is one way to get at that. And this is definitely something EPA was really interested in is how do you do the sort of communication of communities. And this is one resource that we've built towards that end. For the sort of more hardcore users, we do make the raw data available as well. And so the Create Lab at CMU, down here at this URL, they have something called the Environmental Sciences Data Repository or ESDR. And ESDR mirrors a bunch of different sort of freely available air pollutant data sources. We include the ramps in that. And that allows you to zoom in. So if you go to this URL and up in the search box in the top left, if you type in a ramp, you get where all the ramps are or have ever been. And then you can click around and see what data is available. So I've clicked on this one that's in downtown Pittsburgh. It shows the various pollutants that are available. The one that has the check mark then plots the data up to the second. So we update this data set every 15 minutes. And so when I pulled this data down, this was in July. It showed up to that 15-minute period. And you can sort of get a sense of what the data are in real time. You can also download the data from here. So there's an export data button so people can download the data and then look at it on their own offline if they like. I would say I think I am the number one user of this because if I want to quickly look at data from the ramps, I will go to here instead of emailing my technician to pull it off of our server. But I think a lot of citizens are interested in having the data sort of available at their fingertips. But then they go to this and maybe it's a little bit intimidating. But it is something we do make available if people want to use it. So those are sort of examples of sharing more with a lay public. For sharing within scientific community, we've done a lot with or we've done a shared a few data sets with Kilt Hub which is this open data repository that's available through the CMU libraries. And they've been a really huge help in helping us get this up and running. So here's just one example. So this is the raw measurement data that we collected as part of cases. And Kilt Hub is great because there's a DOI associated with this. So in the paper we can just put here's the DOI with the data. We can do updates so you can see here that this is sort of revision one of this data set. And it really helps me because rather than someone emailing me five years after the papers published and say, hey, where is that data? This data set was put up by Peshie Gu who's name is right there. Before he left when he finished his thesis I said, hey, I need you to get this data set and so we can put it up online. And then down the road it's sort of archived. I don't have to worry about someone coming in a few years and then me hunting him down or hoping that I can find the files on my computer which is probably a hopeless cause. And so we've done that. We've also published data analysis through tutorials. This is for the low-cost sensor data. So we've written a few papers on how to treat data from the sensors and how to calibrate them. And so Carl Maylings who was a postdoc with us here at CMU put up on Zenodo which is sort of an open access data repository. These tutorials need a really great job. There are videos that go with them. There are data sets that people can download. And then this is a resource that we can use. So going forward we have a project where we're going to be sort of doing sensor deployment and training in Africa and we'll use this resource. Then lastly I talked about cases. We're building these various models. If you go to the cases website which is cases.us you can actually download the outputs from those models. So these reduced form models and then these statistical models that have been built as part of cases are available. And you can go and we have a lot of detail there. You can download the models at sort of a state level or a national level or even at the census block level. You can do it for different years. And all we really ask is for people's email addresses so that we know that we're not getting sort of, so that we have some sense of whether people are from academia or government or private people. And to date we've had sort of 1500 unique downloads which I think is pretty successful. So then just lastly real quick my closing thoughts. I do want to point out I'd be interested to hear through the course of this ways to sort of make getting the data online a little bit more streamlined. I'm really glad that we've done it but it does take a concerted effort. Getting hunting my student down and making sure we get things up on Kilhub before he graduates. It does take some time doing the metadata to take some time. And obviously we're in a pretty niche market here. I'm not expecting 100,000 visitors a day. We're putting in this effort for sort of a pretty small population but I think there's a lot of value in doing it nonetheless. So I'm there and I guess I should unshare at this point. Oh that's after you can yeah right yeah I guess while we have questions coming in we can have the next speaker set up. So audience do you have any questions for Al? So I guess I'll start with question. Al is really fascinating work to see like you're pulling all these data sources together and share with all these different stakeholders huge amount of work. So as from your talk seems like you're sharing your resources at different places your dashboard, the Kilhub, and use in although do you have a consolidated like view of how much the data is used overall in all different platforms? No I don't. No the only place where we're really tracking sort of use is on the cases.us website. Okay everywhere else we don't have sort of active tracking going on. So it's more about like word of mouth people telling me that they happen to download it. Okay yeah the reason I ask is often I see the same problem with many researchers sharing their resources and don't have a really good way of tracking the usage. I think a lot of people doing the same work might have the same issue. So I think Kilhub logs how many views and downloads there are but I'm not doing anything to sort of keep track of that. So unless I happen to log in and check you know I'm not cataloging that anywhere. All right sounds good fair enough. So is there any question from an audience? We have a manageable size of audience so if you would like to speak up feel free to raise your hands and unmute yourself. All right so sorry that was um yeah so I guess now thank you Elle. We'll invite you back for the panel discussion. Right now let's move on to the next speaker. So our next speaker is Varsha Akodir. Sorry I might not be pronouncing the last name correctly. Dr. Varsha. All right thank you. She has nearly two decades of experience as a research data curator having worked as a curator for the human genome project and the gene ontology project. Varsha's publishing center began with a short stink at f1000 research followed by over five years as an editor for the nature research journal um data science uh scientific data sorry. Varsha leads the team of data experts um delivering springer nature research data support service and also contributes to the design development and delivery of the nature research academic academy data training workshops. She curates and maintains the springer nature recommended repository list and is an executive advisor of fair sharing dot org. A member of the co-data international data policy committee and program chair for the better research through better data conference series. Varsha good to see you here you have the floor. Thank you very much can I just check um can you is my screen being shared? Yes you can see my slides yes. Yes fantastic okay thank you so much for inviting me to present and I'm really pleased I could finally make it although not to pitch back in person but I'm really pleased to be here. So I just wanted to tell you a little bit about the kinds of things that we're doing uh with regard to uh data stewardship and data curation and as part of the publication process. So let me just see if I can make my slides move on there we go. So this is this is a graphic that was produced I would hands up if I didn't produce this but it's a very useful graphic that was produced by my colleagues at springer nature and it just really encompasses why why we think open science is important and why just sharing research data is important um and we try and think of it in three different buckets if you like first of all let's take the benefits for the individual researcher. There's um if you share your research data you're building your your reputation as someone who understands what they're doing because you're sharing your data people can um look at your data sets they can have those proper scientific um robust interactions with you and you're basically saying you have faith in what you're producing as a scientist because you're you're open to sharing your data and you're asking people to you know assess the work and that's exactly what we do in science we want to make sure that we replicate each other's work so then the sum of what we know to be true can be um agreed by more than one lab that's the ideal. Um we know for example that there's a recent study that came out earlier this year showing that data papers research papers with clear links to the underlying data sets that have been shared in a repository seem to have an uplift of 25 in terms of citations and they looked at almost I think just over half a million papers this was a really big study and it's quite a significant result showing that sharing your data you are making your work easier to use presumably that's what we'd see the citations. The next group that we think about is the research community as a whole so if you're able to interrogate each of this work check that it's valid replicate results that's really useful but also thinking about you know reducing the need for unexperiments that have already been carried out by your colleagues or by your collaborators elsewhere um as we saw with um Al's excellent talk just previously sharing that data sharing those resources then enable training of undergraduates of um in or you know researchers in other countries I think that's a really important thing to bear in mind and then we think about that third bucket which is the benefit to the wider society so bearing most most researchers are either publicly funded or charitably funded so we've got the lay public putting money in towards research and so we would expect that research to have some benefit for that for the public and we know things you know within this very difficult situation in many places around the world where fact and opinion are quite often interchangeably used and we need to make sure that we are sharing our research data so that the citizens of of the world can actually interrogate that and make that help to make their own minds up and try to start separating fact from opinion and understand the benefit of the research process and the scientific process and we want research data to inform public policy um but we also know that their economic benefit economic benefits of sharing um research data more widely um I think a figure that's quoted for the data that was shared after the human genome project is something like 15 billion over so many years that we've seen an economic benefit for having those data sets shared so in order to understand researchers perspectives on on data we spring in nature carry out surveys they have carried out annual surveys over the last few years um I should say all of these data sets all of the um the white papers that have been produced are openly available on the repository on the fig share repository and just also um let you know that there'll be another state of open data coming out the state of open data 2020 which will be released in I think the third week of November so please do look out for that but let's delve into what we see here so looking back to um over the last couple of years of data that we have 2019 and 2018 we asked researchers what circumstances would motivate you to share your data and this is what we found the biggest reason that people report as a motivation for sharing data is increase in the impact and visibility of their research second highest response was for public benefits and thirdly the most important thing for researchers was getting proper credit for sharing their data um as again as Al's mentioned um just previously the fourth is journal publisher requirement uh funder requirements we can see further down the list um absolutely journals publishers funders are increasing um increasingly asking for research data to be shared for all of the reasons I just explained previously we've also asked researchers what makes you not want to share data or what problems concerns do you have with sharing data sets and what we see is the biggest concern that researchers have had for both of these years running is concerns about misuse of data second to that um is people being unsure about copyright and licensing they don't necessarily know how to share their data what license to apply to that data set and you can see here another a really big concern is not a pre not receiving appropriate credit or acknowledgement for having shared their data again as Al just mentioned it's it's an effort to share data properly and to make sure it's understandable and there is an um there is obviously for very understandable reasons a barrier when people think well if I'm putting all this effort in what's going to be the benefit for me am I even going to be recognised for this and I think there's still work to be done from institutions for example for recognising where researchers are sharing their data but what can we as a publisher do so we I'm speaking from my viewpoint spring in nature one of the largest publishers working on scientific journals so what can we do in terms of credit and visibility so I'm sure many of you are aware of there are such a thing as a data publication or a data journal or data paper and I'd like to talk about two such journals today so the first being scientific data which is a nature research journal the data papers published in scientific data are called data descriptors so that's the name of the format and scientific data data descriptors are sent for peer review we expect peer reviews to look at the data as well and I'll talk more about that in a couple of minutes and the scope of scientific data is sound science so the research must be carried out well there's no requirement for novelty or perceived impact at the point of publication it's based basically around sound science but the important thing at scientific data is there's an emphasis on data reuse and again I'll talk about that in a few more minutes but the data descriptor should make it easy for people to reuse the data set that is being described the second journal I'd like to talk about today is BMC research notes and they have a data paper format called the data note this is intended and designed as a short format again the focus is sound science not perceived novelty or impact and the emphasis is on data sharing so to make it quick and easy for people to share their data and then to get that credit for both of these journals to get that credit in the form of a citable peer reviewed paper so just thinking about peer review at scientific data what is it that we're looking for what is it that peer reviewers are asked to do and we can think of this in three distinct categories here the first is experimental rigor and technical data quality so reviewers are asked what are the data produced in a sound manner does it make sense in terms of the experiments that were conducted are the data looking like you would expect what's the quality of the data like for example have they provided appropriate statistical analyses and what's the experimental rigor is there an appropriate depth coverage have they used the right kind of controls that you would expect for the kind of experiments that that they have described it secondly we're asking reviewers to consider the completeness of the description have the authors provided enough detail to allow others to reproduce the steps or to reuse the data and bearing in mind reviewers are going to be looking not just at the information in the manuscript they'll be also looking at the information that has been shared at the data repository where the data are held and we ask reviewers to comment on whether the the way that the data has been shared if that's consistent with the minimum reporting standards that are relevant for that type of data set so in some fields there are very good minimal reporting standards and we expect data sets to adhere to those minimal reporting standards where available and then thirdly we ask reviewers to consider the integrity of the data files and the repository records so do the data files appear complete and match the manuscript description so if the manuscript mentions that this data set is on six patients can you see at the repository data relating to those six patients two six you know are there data from six people or are there data from five or seven does it all match does it make sense and importantly we also ask peer reviewers to consider whether data have been archived to the most appropriate repository for the type of data that they've shared so again in some fields there are very clear community mandates and community norms which insist that data should be shared at a particular repository an example is a genetic sequence should also always be shared at an ISDC consortium repository just excuse me one minute so what else can we do as a publisher and as how Jin just mentioned we do provide some training and advice to researchers so one of the things we do is we run the research data help desk and the email addresses there on the screen and this is run by myself and my colleagues in the research data team at Springer Nature and between us we have expertise in data curation and management and archiving digital preservation copyright licensing and IPL is publishing and we can provide guidance and help with with writing data availability statements and we have contextual examples across multiple disciplines and we also maintain a recommended list of repositories so the idea is that if you are wanting to share your data and you're not sure where you may share where you can share those data you can go to that repository list and get an indication of which repositories you might want to start looking at the other important thing to bear in mind is we support the use of community and discipline specific repositories over and above generalist repositories we want data to be in a repository but ideally data should be like we've liked so we want genetic data sequence data with other genetic sequence data because it just makes it easier for data to be discovered and data to be used. We have developed a series of research data training workshops and modules which we deliver as part of the Nature Research Academies and we can do this in person but obviously this year all of our in person training has been over the web that we have developed series of webinars in person training that we then help we can then deliver to institutions to funders to help develop researchers data sharing skills and then we're also looking at ways in which we can help researchers to share their data at the point at which they're publishing their articles so this is our service of research data support. Currently our curators our data stewards work on the two journals I mentioned previously BMC research notes and scientific data making sure that those manuscripts are consistent with what we see at the repository making sure the data sets are shared accurately making sure that the links to the data are accurate and consistent so that once the paper is published people will actually be able to go through to the right data set. We've expanded this out as a service as research data support to what we call stewardship service so again that is looking at the whole manuscript helping researchers to identify data sets that could be shared that should be shared in some cases and helping researchers to really make create very rich data availability statements. So we are working with three MPJ Breast Cancer MPJ journals I should say the type of we just had another one start last week so I should have updated these slides but working with the Hormel Institute on the MPJ Precision Oncology with the Breast Cancer Research Foundation on the Breast Cancer Journal and also the RMIT University which is based in Melbourne on the MPJ Urban Sustainability Journal and we also have a partnership with Welcome where we can provide research data support for all welcome funded researchers no matter where they're publishing. So what do we do when we're looking at these papers what does stewardship and curation actually look like at the publisher level? So potential outcomes of data stewardship we might find that there's information missing either in the manuscript or at the data repository or we might find errors in either of those places and we would indicate best to the authors and the idea is that these errors and additional information can be added prior to the manuscript being published. The team also provides suggestions to increase the fairness of the data and apologies if you're not sure what acronym FAIR means I'm happy to talk more about that in the question session afterwards but essentially FAIR stands for Findable, Accessible, Interoperable and Reusable and so we want to increase those four elements of the data set and so we might comment we might provide suggestions guidance on the repository metadata on the data access of the data licensing conditions that someone's used an example of that might be people think oh there were people might be used to working in a clinical environment and be aware that they can't necessarily share their data openly move to doing an experiment in mice and then just need reminding actually but those mouse-based experiments can be shared openly there's there's no privacy considerations things like that we provide guidance on the manuscript any elements of the manuscript the tables text figures and also we can provide guidance on making sure the data file names make sense are comprehensible to people outside of that research group or indeed how the data files are structured at the repository so in terms of increasing visibility of research data we can think about it in two different aspects so thinking simply first of all let's increase the visibility of research data for human readers so we have a series of four research data policies that spring in nature we were the first publisher to implement this level of data policy and the idea is that we can encourage our journals to move from data type for data policy one which is a simple you know you are allowed to create a data developer's statement you're allowed to mention research data all the way to something like data policy type four which is scientific data is an example of where data must be made available to for peer review and data is expected to be peer reviewed so what do we see we see as a result of this kind of policy as a result of the data research data support work we see rich data availability statements and this is just an example we can see that there's very clear information about how the data were generated where they can be found and indeed in some cases some of these data sets are available on request and some of the data sets are openly available and we work with the authors to really make sure that the data available on request are available on request there might be very good reasons for doing that especially in clinical medicine research but where data can be shared openly it is shared openly and then we also make sure that data sets are cited and properly formally referenced in the references so you can see in the example I have here we have two data sets referenced here one which is at the NCBI sequence reader archive and another which is at the figshare repository and you can see these are just numbered in line with the third reference which is a reference to an article the standard literature reference and the idea of doing it this way is that we're saying that there is no difference in the terms of citing data or citing articles both are equally important pieces and outputs of research the other part of what we're trying to do is increase the visibility of research data to machines because that is what's going to make it possible for data discovery to really take off so for the journal scientific data we create machine accessible metadata for every data descriptor that is published we make sure we use community endorsed ontologies and controlled vocabularies so that really to facilitate the machine readability and accessibility of these metadata and we also participate in the scholars project so this is a cross publisher initiative intended to really again make it possible to bring out where data are mentioned in a research article and to make sure that the data are tagged in a way that makes it easy for machines to access to find those data sets I'm very happy to take questions and thank you for listening yeah thank you very much Marcia thanks for that was fantastic so we have a we do have a question for you from Alex so the question is did you when you did the survey did you break down by clinical versus non-clinical researchers because in her experience non-clinical researchers are much more interested in data sharing yeah so we have so the surveys are very broad ranging across all disciplines and if you delve into the data there's demographic information within that absolutely so which will include things like geographical location and career stages as well as what discipline a researcher is working with and there's some very interesting insights in there absolutely all right thank you very much for the sake of time we can save more questions until the panel now yeah let's move on to the next speaker uh william thompson uh dr thompson's research focuses focuses on temporal network theory cognitive neuroscience and ever increasingly within meta science currently a postdoc at carolinska institutes dog home uh regarding meta science dr thompson's main interests are how scientists do research both with regard to their communication of results and their hypothesis making so this talk will be about his concerns regarding a sequential multiple comparison problem for data set reuse william take away okay first can you all hear me properly yeah just done the checks next thing can you see my screen perfect yeah and it's moving it is moving but i see part of your screen the one you move to the next screen this one we move to the next slide was that better yeah okay yeah perfect wait well thank you all for in inviting me here and um yeah so two really great talks before me and really great and positive and hopefully i'm not too negative it's not going to be my intention uh but it may sound a bit negative for a bit um so most of the discussion or most of what i'm going to talk about has just relatively recently been published in elife and my work at uh uh during my postdoc at stanford in los paul drags lab uh if you guys want to go into a bit more detail because something is going to go maybe a bit quick during this talk um the tico message i guess if from this is that there could be some fiddly consequences of really well intended proposals for improving scientific practices and maybe we at times we need to reflect upon how we can best solve any new problems that arise and the reason why i think this is important is because i mean i'm still a relatively young researcher and the amount of the amount of change positive change regarding open science has been immense over the recent years and it's really important just to make sure that we don't accidentally create some other problem that we have that we that is that when we um that it creates a lot of damage by the when we come around to solve it so the argument i'm going to put forward and basically i'm just going to build an argument within this within these uh 15 minutes is that we're using open data which i'm sometimes going to call sequential analyses and we we just heard a lot about how much we're using open data is becoming quite important and so if we use if we reuse open data for statistical inferences or confirmatory studies this could lead to an increase in false positives if it's not done um if we're not careful and i just want to note that there are regardless about this there's a lot of good things we can reuse open data for if everything i say is true as well so this is i'm totally informed we're using open data reuse of open data i don't want to be misunderstood regarding that but so let's just take a step back i'm talking i said this was about statistical inference and so what is the point of statistical inference if we take one step back and this is kind of like a nice thing when you're all muted because if i give this talk in person somebody usually uh objects to this one sentence uh and i always have to tweak it but basically we collect some kind of sample and then with that sample we try and infer properties about a larger population and those properties could be if it's statistically significant or how confident we are how credible it is and so forth uh but we try and we're trying to learn something about a wider group or population from it from some kind of empirical sample and that's a that is a generalization i know there are other things people could argue statistical inferences for but anyway within this especially if you're starting talking about statistically significant types of results or if you're trying to make some kind of binarization if a result is true or valid then there's going to be a trade-off between how many false positives we collect we find and how many false negatives we find and um and there are lots of discussions regarding things like base vector and p values which are which are supposed to be thresholds to try and to try and uh kind of set this trade-off level okay so in a traditional empirical analysis what we do is we get some data we then analyze the data we then publish an article with the analysis in there okay but we often do multiple tests per dataset so we get some data and we do multiple tests and then we publish this together uh in an article and or maybe multiple articles now the and sometimes when you publish multiple results in an article you're going to do something which is called correcting for multiple comparisons and what does this mean well multiple because some people may be aware some people may think this is very obvious but the more comparisons you do you you have a greater chance of identifying more false positives so there are multiple procedures when you do multiple comparison correction what you do is you try and restore this trade-off to get to the um balance of false negatives and false positives that you want uh because otherwise you're going to have a greater chance of identifying false positives and there are multiple procedures like Bonferrone correction or false discovery rates there to the popular ones that a lot of people have heard of so the consequence of no correction and this is kind of like a really interesting example from uh Benetitale in 2009 they put a dead salmon in an ML in an MR camera and scanned the fmi activity from it from this and there there's like uh several thousands uh voxels of small squares small cubes in the data and if they didn't do any multiple comparison correction what they what they said what they found was you you see a brain activity in a dead salmon and this is kind of the consequences of why it's really important to do multiple comparison correction and this is what what they were arguing is we need to do multiple comparison correction because if we as scientists are advocating for brain activity in a dead salmon it seems a bit off so when we do these multiple comparison corrections like well what comparisons get grouped together and this is when it gets really really fiddly in the statistics and if I had lots of time I would talk about this for about 40 minutes um but we can consider that we can consider something called error rates which is this idea of how many false positives uh do we want and we can relate that to individual tests and in in that case we're not going to be doing any kind of uh multiple comparison correction and what we'll find is this is the kind of the case where we're not where we'll get brain activity brain activity in a dead salmon so a lot of people advocate this is not what we should be doing and other people have advocated back in the 50s I think it was um some people said that the entire experiment should be corrected for so every single variable you correct for you should do some kind of multiple comparison correction for all possible uh statistical tests you do and most people feel that's too extreme and generally it's um considered to be somewhere in between and some some people argue it's actually very close over to this side of the spectrum some people over here and maybe some people over here I'm probably going to present an argument why it's um that's supposed to be an arrow uh why it's tending more in this direction I'm not going to provide a definition of what a statistical family is but it's supposed to be a set of related tests and it's somewhere in here and it's a big debate in um in the statistical literature if you dive into it and generally uh here's some consensus definitions you sometimes find the it usually gets divided up into two different groups where it's considered the same family if you're doing data dredging and data dredging is sometimes called exploratory analyses or when you're doing explore a statistical inferrence with exploratory analyses and so so if you're doing exploratory analyses then you'll do then you should then it's all considered part of the same family if you're just correlating anything with anything without any hypothesis the second group is if you're doing confirmatory analysis and it's a similar research question and this is also a bit vague and you can have this definition and put yourself anywhere on this spectrum still but what I want to kind of take out of that is regardless of what the definition of family is the next argument I'm going to tell you about open data holds so it doesn't matter what the definition of family is but we still need to know that there is a thing called a family so now I'll get back to where you were that was a long side note now I'm going back to these nice pretty iconic figures I'm not going to stop talking about statistical families for a teeny bit so this is what I showed you before we have data we have analyses we have one person presenting these two and let's now imagine these are the same statistical family and they're correcting for the these two comparisons in this article which is what you often see now let's consider this new scenario when data are open and so we've got data we analyze the data and this article is published and secondly this data also now goes on a server that we've heard about just now with with a data descriptor and all this somebody else analyzes this data and publishes the second this second comparison and before these were considered part of the same statistical family so is there anything now that has made these different statistical families due to this so if we're going to look at this we need to see well what is different between these two scenarios and there are only two real main differences between these two scenarios one is the timing of the test these two are done simultaneously in one paper and in this case they're done separately there's a there's a time lag between them and the second is who does the test so here I've written there's one uh an author up here called Ashley and here I've got Ashley and Blake because uh yeah so somebody else is downloaded to reuse this this this data that's on a server so these are the two big differences and if these are considered that the same statistical family in this one case one of these two things must be sufficient to create a new statistical family if um if this if this is if you don't need to correct in this instance and if you're if you're if you also if they are the same statistical family and there is no correction then we're going to be increasing in false positives that's kind of the point the point of this argument so I'm going to now do something called reductio ad absurdum where I'm going to assume this is true and show that's absurd and I'm going to assume this is true to create new statistical families and I'm sure that's absurd as well so if time creates a new statistical family a genuine way we could correct for multiple comparisons is just wait the x number of days needed to make a new statistical family so if I get that if I scan a dead salmon um all I need to do is is analyze one voxel wait x number of days analyze the next voxel eventually I will say I've um I can present statistically significant results showing brain activity in a dead salmon that I've corrected by weighting and this sounds absurd and that means this can't create a new family second if the person matters well we can think of a similar scenario is that instead of waiting the number of times all I need to do is get more people to analyze each voxel so I can go on to crowdsourcing site like mechanical Turk or any of those and get one person to analyze a voxel I then collect all this information present the results and say I've corrected for these so if I do it for the dead salmon I can say I've got statistically significant brain activity in a dead salmon I've corrected for the multiple comparisons by crowdsourcing and both of these don't seem to be a good way of correcting for multiple comparisons so the kind of the argument's conclusion is that there's nothing about the sequentialness of sequential analyses that generates new statistical families so that but if you're arguing for very small families this may not be a problem but if you're arguing for bigger families then this can become a problem if we're reusing data a lot so I'm going to try and talk briefly about if this how much this is a problem so let's imagine now so this is on some empirical data that we use from the human connectome project and let's just imagine that one group did 182 simultaneous analyses so they did them all in one in one paper and let's imagine 182 different groups did one analysis and they left that uncorrected and there were 68 variables involved that we corrected for in each analysis here but there's always 182 analysis here and one analysis here ignore the bottom part of this figure it's only the top part we need to care about this is the rest some other part of the paper in the green ones at the top here these if we correct the number of this is the number of findings we got if we corrected for all the analyses all these 182 times 68 variables in this analysis we did them all at once we will get two statistically significant findings if we did 182 different groups that reuse this data we got considerably more so so over 30 over 40 depending on how you're correcting within the study so this shows that you can get a lot more a lot more positive and these have a higher chance a much higher chance of being false positives because we are allowing because because that that about that trade off I was talking about between false positive and false negatives is now biasing or going more and more towards false positives or identify more false positives so is this a problem so I'm going to talk a bit more how this could be a problem the last couple of minutes before I get to the solutions so surprisingly and this is kind of like a really interesting part of this is that when we kind of integrate the solution with other solutions people have proposed to science so there's a lot of people want to find differences between confirmatory analyses and exploratory analyses the most famous one is probably Varga Marcus et al in 2012 that argued that pre-registration before the experiment is how we can know something as confirmatory the rationale of this is that this is the only way for the scientific community to be sure it is confirmatory because otherwise people could be doing exploratory things and masking it as confirmatory so that's why the pre-registration is necessary to demarcate the difference between confirmatory and exploratory work but if we accept this so if we accept this and this is where it kind of gets interesting then reusing unrestricted open data has to be considered exploratory and this means it's a very big family and why why do I say this well if it's unrestricted open data then we don't know when the other people when the researchers actually downloaded it so even if they pre-registered their analysis this becomes slightly problematic because we don't know if the pre-registration was after they obtained the data so we're back to this entire problem about what the point about that we don't know if the analysis is being masked as confirmatory so the entire point of the pre-registration to differentiate this gets lost if we're doing confirmatory analyses on unrestricted open data and restricting the open data I do not think is a good solution to this and yeah I could discuss this a lot more but kind of wrapping up now before I get to some maybe positives what I'm trying to do is this illustrates how new challenges and maybe unexpected challenges can arise from well-intended solutions and both like open data reuse really good and the confirmatory exploratory demarcation extremely good and open data has many uses and this is not an argumentation against any type of open data but we should be aware that it's not always a free lunch like just because things are open does not mean it's endless benefits and this also means that we're not going to get like some kind of ultimate data set data collection never stops we always have to collect more data and so the main point of here is the reuse of open data entails what's eventually going to decay in the amount we can actually do these statistical inferences and confirmatory analyses because we may have to correct for how many times we do it so let's just mention a couple of solutions to this problem how how can we go from here one interesting thing we can do is think about how how can we get around this problem with pre-registrations and confirmatory analyses so we heard a bit in the last talk really good talk about data descriptors on scientific data and so forth so one thing that we could actually do is maybe if we publish these data descriptors we actually have a grace period where people can pre-register the results before the data is released and by doing this we make sure that this demarcation between confirmatory and exploratory research is at least maintained for that grace period we can be sure that everything that is pre-registered in that grace period before the data is kind of made fully open but we've got a description of the data to be able to write the pre-registration that in some way could be a partial solution to mitigate the problem but as soon as the data is released then we have the problem the problem's back there's like we could maybe try and do better sequential correction to try and get around this as well this is more discussed in the paper or we can justify our families a bit more we can maybe motivate why our family statistical families should be very small that's a way to get around it as well if necessary something which you see a lot in machine learning is held out data on data repository so something that the researchers don't have access to but in many fields for example in neuroimaging where I am a lot of the time is we have very small data sets so this is quite hard to do to have held out data but if we have held out data there are some really interesting things in machine learning to try and get usable test data and there's some really really interesting people and if you're interested in this I really recommend checking out this science paper which is really good another interesting thing is is maybe the use the reuse of open data can be exploratory and we don't have to do statistical inferences on exploratory work we can just be happy that we we're trying to understand the data not generalize to an entire population and maybe that's what open data should be reused for is to make better hypotheses that we can then do other confirmatory analysis on when we collect more data and release that as well so there is a lot of use for open data maybe it we should just embrace exploratory analysis a bit more and I quite like this this one so this was my take home message that we've got I'm hopefully I've tried to convince you that these with this one specific example we can sometimes get these fiddly consequences and from really well intended proposals and we should reflect sometimes not always but sometimes take a step back and reflect on how we can best solve these new problems before before they become substantial problems so thank you for listening and thank you to the co-author of this paper and the Knut and Alice Valenbe Stiflice for financing my my research so thank you thank you very much William so this is fascinating work is really mind-blowing to think about while we're talking about data reuse you know like as many great things is a double-edged sword so who yeah for the sake of time I guess I don't see any immediate questions coming in so now I would like to invite all speakers back to the spotlight then we'll have questions um for all three of you so any questions for um any of the speakers all all of them so feel free to raise your hand if you have questions so I think I've seen uh from a previous session um there's a question for uh was for Vasha but maybe all of you is relevant for all of you um is regard so is it by Lauren do you want to just go ahead and unmute yourself Lauren sure so thank you thank you my question was about peer review for data sets um I've heard a lot of talk about how difficult it is to find peer reviewers for data sets so Vasha you were talking about this specifically I just wondered how your experience has been with that uh has it been what has written your approach to find peer reviewers for data sets and maybe for everybody else do you have any guidance best practices wish lists for peer review of data sets that you can share with everybody thank you so our experience is um obviously it's in it's in relation to a manuscript that we're asking people to look at the data set so that's the first thing just to bear that in mind it's not we're asking people to look at the data set independently of the manuscript so I would say it fits in the kind of known workflow of reviewing a manuscript um what we do is those questions that I've shared are the questions that we actually include and are in the reviewer form or the reviewer guidance for outside click data so we're guiding our reviewers to look at the data set um um we can't make anyone look at the data set I would just say that as well and sometimes we get reviews back and it's very clear that the reviewer has not looked at the data set and I think that's where that's where my team comes in I guess with the team of uh curators so the work we do on a manuscript just actually takes place after the date has been accepted after the peer review process has completed um and so that's when we're doing some of these checking checks and and establishing that the data are actually matching what's written in the manuscript making sure that the links are actually to the correct data set and there have been instances where we have found just you know people are human they make errors it's natural but we're trying to read out as many of those errors before the paper is published so that it is as accurate as possible at the point it's published um the other thing I would say this year has been really interesting um and difficult for uh journals um in the sense that we've had such an influx and such an increase of submissions all mostly to due to the COVID situation and so it's been increasingly difficult to find peer reviews so that has been a thing and I'm sure you'll all have noticed if you are submitting papers the peer review times have just gotten longer and longer so generally it is getting harder to find people to peer review manuscripts will start so next comment I've never been asked to review a data set but just about that last comment I have never said no to so many review requests as I have in the last like month and a half yeah difficult times so we have a next next question is from Amit hey uh um hey this is Amit from material science department I'm doing my postdoc here at Carnegie Mellon my question is for William first of all excellent talk I really enjoyed it the way you projected the uh the debt settlement and how it can leads to so many problems and sort of problems especially we see in smaller data sets so what I'm trying to get I just like are you suggesting that more metadata is needed as the amount of data is increasing to handle false positives because indefinitely in small data sets if we keep redoing the analysis we are getting into the strap but yesterday I we have this whole discussion whole day and one of the comment came out yesterday was from Marshall Hebert is that large amount of data basically balances out the errors so but in the domain like material science maybe in the domain you are working where we don't have large data sets we have this problem but in the domain something where we have millions of images maybe we don't have any thoughts thank you yeah um it's an interesting question I would definitely say a lot of my talk is definitely biased towards the smaller data set research of for example neuroimaging where we are most likely underpowered in a lot of studies and this is I don't think it's unique to neuroimaging I think there are several fields where it's like this where it's very costly to to acquire lots of data and uh there's also a lot of research of degrees of freedom that can go into analyzing the data sets in other fields as far as I understand there's a lot more streamlined analysis pipelines so even in the pipelines there's less degrees of freedom and that means there's less chance of it's basically if you have a so I guess you all know if there's a hundred um what is that a thousand monkeys on a thousand typewriters eventually one of them will like make a bet or something like that if we have a thousand doctoral students working on a thousand day or work on the same data set with a thousand degrees of freedom they go into get statistically significant results eventually and I think that's going to be less the case with uh with larger data sets obviously because they're that's just that's just going to be the case but so I agree with you it's less of a problem in larger data sets I sorry I think I've waffled a bit have I uh answered your question uh yeah thank you okay great I'll meet myself uh next Alex has a question would you like to unmute yourself and ask sure yeah so um I actually spent some time last year reviewing the data sharing policies of various journals and so at springer nature of those four levels only level four actually requires data sharing of all of the authors policy levels two and three just require sharing for specific types of data like sequencing or molecular structure data and policy level one doesn't require anything not even a statement of data availability so springer nature only has seven journals I counted that have adopted a level four policy and casually checking one of them it actually is using the level three policy requirement language so there's virtually no uh springer nature journals that are requiring data sharing from most researchers even heavy weights like nature which could presumably ask its authors to do just about anything to publish they're not requiring it so why is that why are the journals so reluctant to make this a requirement so um I think one thing to bear in mind is that so when I I've been at springer nature just over six years when I started we were the kind of black sheep family we were sitting in a corner and we were doing something weird about research data no the rest of the business really knew what we were doing but they kind of tolerated us is how I think of it and we've managed in the last six years we've moved from a point where that was the case to where research data is front and center and it's a culture change as much as it is a culture change for researchers to get to the point where they feel comfortable to share the data it's also a culture change for journals and for editors to get to the point where they're requiring researchers to share data and so um the idea of the policy levels is to make kind of a step wise access take step take people through the um the research data sharing journey in a step wise way so that each step becomes a little bit easier okay you've got a level two policy could you move it to level three what's the next step that you'd need to do um I totally agree with you that there is a big difference between level three and level four as it is currently written and implemented um I would say it's a work in progress so we're in this new world together and we are working towards a situation where we can move towards uh increasing the course of data sharing um I don't know if people on the call are familiar with um an organization called the research data alliance that it's basically bringing together people that are interested in research data um from all different stakeholder groups and as part of that work there's actually um there's actually a group there working to create a cross publisher journal level leveling policy level which once that's been agreed the idea is that then each of the different publishers will then use that same um policy levelling um and so we're kind of it doesn't make sense for us to re jig our policy level when we are involved in this other work and once that's been finalized the intention is we will then implement that more step wise level so to make it easier for um journals to go from one level to the next um we are working hard with our editors with our journals to move them across to the point where we can really embed data sharing into everything that we do but um it takes time unfortunately it takes time I think that RDA working group just just published the um their finalized recommendations like last month yeah yeah absolutely and my colleague Rebecca Grant she's she's one of the co-chairs of that group and so um her her work her her um role in the company is one of her roles in companies to implement that policy so as I say all of these things take time it's not like you know the idea can publish it and then next week we'll have that implemented it does take time interesting thank you William and Elle do you have something to add okay all right so I see a hand from Allie hi yeah um so it's okay if uh William and Elle didn't comment on this uh previous question because this one's a bit more for them so it was mentioned that there was a question yesterday about uh sort of error propagation that was me uh guilty as charged I was trying to um express a concern or at least just a question about the larger idea of modeling on models so you know I take the issue that William has raised very seriously it's similar to a conversation I had with a statistician uh at science a couple years ago and basically as far as the interaction between open science and these AI and L tools which we talked about a lot yesterday I guess my concern was every time we take a step away from the real data that was originally collected on the ground I'm concerned that what we may do is instead of having our models built on say to use the air quality research instead of having all of the models based on the ramp data then there's some kind of a composite formed from that and then the next AI ML tool is actually built on the composite rather than going back to that original ramp data and then we build composites and composites on composites and I look at it really as a philosophy of science question where we have to understand what these models are doing and too often unfortunately we we don't seem to um so you know today of course is more open science and open data focus so if anybody has a comment on you know what we could do either at the philosophy of science level or at the data usage level since I am a librarian and I'm concerned with uh you know repositories and and making data reusable and useful how can we guard against this so I I agree with with what you're saying there I mean definitely in sort of the there's a whole community of people doing these low-cost air pollutant sensors and one of the big topics of discussion is sort of openness and transparency and how we're doing the calibrations because like for instance some of our calibrations are non-linear um and you know people might naively assume that they're linear and can just be applied anywhere and you know if they're not then you know you're taking this sort of number and and putting it out of context both in sort of time and space and uh so yeah I don't I'm I don't necessarily have any answers but I yeah it is something that that it is an active discussion in the in the community and I guess I can add to that and I think it's an excellent point you brought up and I agree completely with that as well and uh I guess all I can do is I can suggest when you talk about philosophy of science and so forth one way I see it is so I work a lot with the network models and there we create an abstraction from the concrete to the network and when we do that uh when people then start analyzing the network there's two ways you can go you can go more and more abstract away from the concrete phenomena or you can analyze a network in a way which is interpretable of the phenomena you're analyzing and I'm not entirely sure if this is to do with how much this is to do with open practices but it's more to do with best practices of how to analyze empirical phenomena uh but it's an excellent point and I'm happy to talk about it as well awesome sounds good thank you so we have a next question from Keith please go ahead okay thanks I I want to be too long because you'll hear from me later in the day I'm Keith Webster at Carnegie Mellon I wanted to ask predictably for a Scott about the costs and infrastructure aspects of this and maybe I could pose a question to Varsha and then a second question on the same theme to Alan William so the the first question is about the costs of sharing or publishing data in Springer scientific data I know that you charge a transactional fee and I'm wondering what that buys in the sense of long-term accessibility is this a I pay my $2,000 and you will look after my data and make it available forever and ever who knows what that might mean and then secondly to our research colleagues what are your expectations of long-term access to your data and whose responsibility do you think it is to cover the costs for that is it your institution is it a publisher is it your research funder should this be a common good that some intergovernmental organization should pay for thank you for the question so to be clear scientific data is a journal and not a data repository and that's something that has been very clear with scientific data from the start we don't scientific data works with has some integrations with like Big Chef for example for data sets that can't go anywhere else but we are with scientific data is always maintained that community-based repositories are the best location for data we don't want to see sequence data in Big Chef we don't want to see data that could be hosted in the you know excellent repository pangea in Big Chef and so what we're trying to do where we can is support the use of those community repositories but there are not all communities have repositories not all disciplines have good repositories and so therefore we have collaborations with Dryad with Big Chef are the generalist repositories that authors are very welcome to use and so the APC doesn't cover anything to do with the storage of the data unless they're using the integrated Big Chef in which case it's subject to the standard Big Chef preservation I think it's something like 20 odd years after it's been after the data set been published something like that and the publishers are well versed in long-term preservation that's what we do in terms of the scientific record with those articles so even if a journal fold the articles will remain accessible and so we're trying to think work with repositories to get them to a similar point if I can answer your next question I personally don't think that we will in a few years time be able to preserve all of the data that's ever been produced we are going to have to start making decisions about what we okay so we have a yeah we have a one this is probably the last question from Brian sure thank you guys all for talking about these topics this is actually kind of this is related to a question that I asked at open science symposium like two or three years ago but to address some of the issues with sort of data accumulation and also data reuse and the statistical sort of legitimacy of reuse should data sets just sort of be replenish with fresh data so even things that we think we know about data data sets on you know things that come to mind are sort of like genomic databases where it's the formatting the data is much more formalized and data reuse is really common should everything just basically be expected to be reasserted at some point sort of like an expiration date I guess I can start answering that great question and I think in all use cases it I think in a lot of different contexts there's going to be a different answer to that question unfortunately if you were hoping for one general thing but and it also depends on how you intend to use the data so for example in a lot of instances where you're maybe fitting a model to some training data and want to use some some data to validate the model then always adding data to that data set is going to be very useful because the new data can then be you can be the new test data and that was a great way to to prevent kind of overfitting to occur in machine learning or something like that so but in other cases so for example so in the field I generally work in psychology neuroscience and stuff like that we do experimental settings where it's going to be hard if we have 100 people to do one experiment it's going to be hard to add to that experiment continuously for ad infinitum we're going to have to say stop at some point because we're going to want to do different experiments or and so forth in those instances the important thing is to make sure we replicate studies or so we so we so we know if the results that we have from previous experiments are valid or not and then we at least we get two data sets instead of one data set and I then I think that's also very useful where we get multiple data sets doing a similar thing so hopefully I answered your question again I think I sneered waffled a bit that hopefully that was a good answer and to add a perspective from you know a different field and I think this answers Keith's question about what makes forever I mean since we're doing real-world measurements in the real world is always changing for policy reasons and other reasons you know 10 years is pretty darn old anything older than 15 years is pretty obsolete and so to me forever is approximately 15 years you know and so if we overwrite things every 10 or 15 years that's well or if we you know generate a new data set every 10 or 15 years that's totally reasonable all right thank you everybody thanks all the panelists your talks are so so great and thanks audience for the great discussion so I encourage you like there's obviously a lot of to talk about I encourage you to take the discussion further in the slack and in gather town so we're going to take a 15 minutes break and we'll be back on time at 1145 thank you all