 Welcome back everybody. I guess we're gonna get started. I know we're all full of tacos on this lovely day, but it's time. We must press on for open science. Our first speaker in this session is Timothy Burstinen. He is the associate professor in the Department of Psychology and the Center for Neural Basis of Cognition at Carnegie Mellon. He is relevant to his talk today. He's the director of the Cognitive Axon Lab and the co-director of the CMU Pitt Bridge Center. Thank you all for staying awake through the food coma today. Let me talk. This is going to be kind of a soapbox talk about the kind of state of the field and not necessarily a very focused on individual things that I've done, although we'll get into what we're doing to kind of address some of the issues in the field. So this is me kind of standing up and ranting for a little bit. I work in the field of neuroimaging, particularly neuro, sorry MRI neuroimaging. And MRI has become an incredibly important tool for understanding human cognition and the biological basis is thereof. So you probably are all familiar with the concept of functional MRI or fMRI where we can look at task related activity patterns in the brain while you're thinking about, you know, thinking about nouns versus verbs, let's say. But we can also use MRI to look at activity flow or information flow in the brain by looking at correlations and hemodynamic signals so we can see which areas are talking to each other on a moment by moment basis. We even have access to a lot of the kind of structural connectivity of the brain using MRI. So this is a map of the white matter pathways measured using a technique called diffusion weighted imaging. So we can actually start to look at the underlying circuit level architecture of the human brain. And we tend to put these all together. So we try to look at the brain in these different ways using MRI and try to understand how these large complex circuits give rise to human cognition. And we're at kind of an inflection point in the field right now. But to really understand what the inflection point is, you kind of have to know where we've come from first. So let me give you a brief history of the use of MRI as a research tool. So in the prior to the 1990s, MRI was largely a medical imaging technique for structure. So this is an image of a pretty typical structural image that you would have gotten in the 1980s. It was usually used as a proxy or a supplement for CAT scans. Really just look at stroke or tumors. But in 1991 and 92, there were two major discoveries that really kind of revolutionized this technique as a research method. That was the discovery in 1991 that MRI is actually sensitive to the direction and intensity of water diffusion. And this allowed for us to start using it to map out the white matter pathways in the brain and build diffusion weighted imaging. In 1992, a series of papers, it was a big competition between multiple groups to kind of get their papers out, started showing that the MRI signal is actually sensitive to the blood oxygenation changes that happen as neurons fire. So this is the very first paper, the one that won the race that came out in 1992, showing that if you flash a checkerboard on a screen, you get this activation in the early visual cortex in areas that you would expect. So really open the door to using MRI as a research tool. Then in the late 90s and early 2000s, there was a refinement and advancement of this technique. So including the rise of what's known as resting state fMRI, which means that you just look at correlations in the MRI signal to see which regions form networks. So this is the first resting state fMRI paper showing that if you just look at what regions correlate with the voxel in the motor cortex, you can recover the cortical motor networks of the brain. And then over the last 10 years, we've seen this real big explosion in the types of ways we can look at this data. So for example, we've got refinement of tractography methods to actually do those maps of white matter pathways, the rise of using machine learning to do decoding analysis so we can actually decode what people are thinking, whether or not they're thinking of nouns and verbs or animate or inanimate objects, we can actually reliably predict that from activation patterns. What you're seeing here is actually a map of semantic relations in the brain as somebody listens to a podcast. So we've really seen this explosion of very powerful tools. We've also seen a rise in very big population level research studies like the human connectome project where we're no longer collecting data from a few subjects or a few dozen subjects, but to hundreds or thousands of subjects at a time. So that gets us to the modern day. And what we're faced with in our field is a problem of size and complexity. Our data sets are getting bigger and the questions we're asking are getting more complex. And let me kind of show you the kind of scope of how big things are getting. So first, our data sets are just simply getting bigger themselves. So this is a review from Van Horn and Toga that's now five years old, but it turned out to be shockingly accurate in their predictions. So this is just the raw size of individual data files coming off of the MRI scanner. And as we increase our spatial and temporal resolution of the technique, as well as add in things like multiband imaging, which allows for you to kind of collect multiple images at the same time, our data set sizes are just getting physically bigger. You add that to the fact that our sample sizes are also getting bigger. And so as a way to kind of show you this or illustrate this, Christy Benuelos in the lab just did a quick search through PubMed looking at the top 10 fMRI studies that were published in the journal NeuroImage across the last 20 years. And what you can see here, I'm just aggregating them in five-year bins, is that before 2000, you were getting about eight to 12 subjects per study. And then in the early 2000s, when I started graduate school, the average is about 12 to 16 subjects per study. And we're quickly increasing the sample sizes. So in the late 2000s, we were averaging about 20, 25 to 30 subjects per study. And then we get this huge explosion. And there are two reasons for this. One is the rise of these big data analytic research programs like the Human Connectome Project and other similar large research projects. And that's actually what you're seeing here in the 2011-2015 jump is the first of these papers starting to get published in NeuroImage. You also started to become keenly aware of power issues in our studies. So in the early days of MRI, we were kind of in love with these big blobs we were seeing in our data. And then we suddenly started to realize that there's a huge problem with statistical power. And so we now have to deal with just to meet the power for the analysis we're using, larger sample sizes. So we're getting larger physical data files. We're running more subjects. And then on top of all this, our analysis is getting more complex. So for example, when I started in fMRI in 2001, our average voxel size, which is the spatial resolution of our data was about three and a half millimeter voxels. That meant we had about 20,000 or so analyzable units in the brain. So if I would run about 20,000 regressions, look at blobs, that was when I started. As we improve the spatial resolution of fMRI with things like multiband imaging, that number goes up to about 125,000 analyzable units per brain per subject. Now as we move into measures like connectivity, for example, resting state functional connectivity matrices, we're not looking at covariance. So that increases the number of analyzable units because we're looking at edges. So for example, this is just an example from one of my recent studies, we have about 195,000 analyzable units per subject per brain. And the biggest one we have so far is within diffusion imaging. This is an example from something known as connectometry, where on every individual subject we have 433,000 analyzable units. So the complexity of our analysis is just increasing along with the size of both our samples and our physical data files. And then finally, along with the complexity of the analysis, the complexity of just our standard pre-processing, our data cleansing, our data management is getting more complex. So when I started doing fMRI, the way we pre-processed our data coming off the scanner was relatively straightforward. We would just do some corrections for head motion and a few other artifacts and call it a day and get our pretty papers published. But you try to publish a paper that uses these standard practices. Now you wouldn't get past a basic journal. Now this is what our industry standard pre-processing looks like. You have to do surface rendering, distortion artifact correction, physiological noise artifact correction. And this is actually a schematic of our industry standard pre-processing pipeline known as fMRI prep. And it takes 24 hours to run per subject. So it's a massive investment just in computational time to just get a data in a state where we can start working with it. So we have this dilemma. How do we as a field adapt the exponentially increasing size and complexity of our data? Well, the answer is to build a stronger and more effective research community. And I'm going to say that the only way that we can do this is by adopting open science practices. And that's actually where we're going as a field. So compared to fields like genomics and astronomy and physics, we're kind of a little bit behind the time in terms of things like data standardization. But we're catching up. So about three years ago, an open science community out of Stanford proposed with the broader imaging community, our first international data standard, which is known as BIDS, Brain Imaging Data Structure Standard. It's a simple and intuitive way to organize your neuroimaging and behavioral files. And the idea behind BIDS is that you should have a way of organizing and naming your data and have the metadata in such a way that you can hand your data set to somebody without saying a word to them, and they know everything they need to know about your study. So it's meant to be 100% shareable in the absence of any other communication between the researcher who collected the data and the researcher who grabs the data. It follows all international neuroinformatics standards. It's flexible and easy to adopt, works with multiple modalities. And BIDS has allowed for expansion of data sharing in our field, unlike anything I've seen in my academic lifetime. So for example, if you track open neuro, which is our field's standard data repository, so you collect a neuroimaging experiment, you can put your data up on open neuro. What we're showing is the collective samples in open neuro across time. And you see, even in the last year and a half, we've had over a doubling in submissions to open neuro. Open neuro only works if you submit your data in BIDS format. So you have to have your data in BIDS format already. So this is just a way of kind of showing how fast the field is kind of adopting this industry standard data architecture. Along with data sharing, BIDS allows for you to make easily shareable data tools. So there is this concept that's been adopted in the field known as BIDS apps. And the idea is that if I write a data pipeline that assumes my data is written in a BIDS format, and I want to share my pipeline with you, I have all the same data architecture assumptions as you do. So I should just be able to easily hand my code to you and it work. And so what a BIDS app is, is it takes these pipelines, wraps them in a docker, so you have this easy extensibility of sharing your code and basically boils down the execution logic to being a very basic run script. Run script, point to your data in a BIDS format, point to where you want your data to go, maybe have a little bit of a flag, and then you're done. It should work automatically. It's platform independent because we're using docker. You add it in with singularity. This easily extends to high performance community centers as well. So with the rise of BIDS and BIDS apps, we now have an infrastructure that's being adopted by the research community that allows for data and tool sharing. So how do we get to actually making this a completely community adopted standard? And I'm going to make the case that we need to foster these open science practices at the point of data access, because right now this sits on the shoulders of the individual researchers. So I run as a co-director the CMU Pitt Bridge Center, which stands for Brain Imaging, Data Generation and Education Center. It's down on the first floor of this building here. And when we put together this collaborative imaging center, we kind of adopted it based off of four core principles. An imaging center is both a research tool and an education tool. An imaging center strives to remove barriers to access both to getting data and to analyzing your data. An imaging center serves as a research community hub and that an imaging center fosters innovation. So underneath all these principles is the basic idea that we should be adopting open science practices. We should be sharing as a community our tools and our understanding of our data. So in order to remove these kind of barriers to kind of adopting these standards, you kind of have to know what existing barriers there are in our community. So right now everybody kind of treats an imaging center like a microscope. So MRIs are too expensive for individual labs to have, so we usually pool together, we get one MRI that people share and use. So this is Sarah. Sarah is a researcher at CMU and her lab runs subjects down in the bridge center and then typically what would happen is she'd get her raw data and her lab would have these internally developed analysis packages based off of some, you know, custom file naming convention and data architecture that her lab has adopted. This is John. John's a researcher over at Pitt. His lab does the same thing. They send their subjects down, they get raw data up and they might have a different set of software packages that they analyze. And in order for John to share any kind of code or data with Sarah, he has to make a transformation function. He has to find some way of transforming the logic of his analysis and the logic of his data into a way that Sarah's lab can use. And for Sarah, she needs to invert that process. And the problem really sits with how these researchers get this data. So the typical data access point for a neuroimaging center looks like this. So here you have on the left, basically the MRI system itself that reconstructs data and makes data in a format known as DICOMS, which is just a very, very, very, very raw and polished data type. Usually what we do is we would SFTP or SDP that data over to a separate file server that might have a sort script. Researchers get that really raw sort of data and then they have to build their own internal pipelines. And if they want to make their data BID standard, they have to write a conversion code for each one of their experiments to make it BID compliant. What we did at the BID center is we teamed up with Flywheel to make this automatic cloud based data server that immediately gets your data in a BID compliant manner. So we have what's known as a Reaper. It reads data straight from the console live. So as you're collecting data, it's being pulled into the cloud. And because we adopted a basic file naming logic on the console itself, we have an automatic tool that can use that logic and parse the files into BIDs when it hits our cloud server. So with the single push of a button, researchers get their data automatically in a BID compliant format. We're also giving things like automatic quality control metrics and quality analysis metrics of their data and standardized preprocessing all for one. So basically a researcher can push their data up, they'll get it in a shareable format, standard and preprocessed in a standardized way so that allows for them to share. So essentially what we've done as the center is we make it so that both Sarah and John get their data in the same format and processed and cleaned in the exact same way. So that way all the tools they build from this follow those shared assumptions. So the barriers for allowing them to share and communicate are gone. A tool that Sarah's lab develops can easily be ported to John's lab. And they don't have to be now in the same imaging center. They can work within this broader community. So just to go back, if we want to build, if we want to kind of deal with this large data complex data problem, we need stronger, more effective community and I'm going to change what I said is the how to saying adopt and foster open science data practices at the point of data access itself. We have to put it at the point where you get your data because that way it lives behind the scenes and makes it easier for researchers to adopt you practices without really having to think about it. Okay, I'm way over time. So I apologize. Thank you everyone. Do we have any questions? The open neuro repository or on our data server? For the open neuro, it's the it's the down both the raw and the free process. So both raw and process data. Yeah, you actually have original raw bids and then the free process data all of that. That's great. Yeah. All right, up next we have Sarah Weston who is an assistant professor in the department of psychology at the University of Oregon, Go Ducks, where she studies how personality affects health outcomes often using publicly available data. So someone earlier this morning mentioned data parasites. I am a data parasite. I almost exclusively use data that were collected by other people and and in doing so and in thinking about how the work that I've done is influenced both by open science by the generosity of people making their data open to me. I've been thinking about how I have a responsibility then as a researcher to make sure I use those data in a way that's robust. So I'm going to talk about some of the the challenges that I think are unique to those of us data parasites who use other people's data, things that we're probably not considering right now, but we should be. Before I start though, this summer, Talia Arconi wrote a blog post called I hate open science. It was a very good blog post with very provocative title. It turns out Tal did not hate open science. He hated the term open science and what he said is that this label gets applied to many, many different things including reproducibility, replicability, equity and inclusivity, data sharing, transparency, diversity, all kinds of stuff. And he thinks that the term is often used as a cover for all of those things when maybe the person giving the talk or describing a point of view wants to focus on some of those. So with that, the goals of open science that I'm particularly interested in today are these three, reproducibility, replicability, and equity and inclusion. And again, just to clarify reproducibility and replicability because these terms are often used differently by different fields or even different researchers in the same field. When I speak about reproducibility, I mean you should be able to use my data and my methods and get the exact same results. There's no statistical test involved to check whether you got the same results. It's did you get that number that I got in my paper? Whereas replicability is the idea that we can use different data and the same methods to get the same or similar results. And those are slightly different. And those are both distinguished a little bit from sensitivity and conceptual replications as well. So secondary data analysis, specifically the analysis of preexisting data sets is the focus of my talk today. And this is the idea that we're taking data sets that were either designed to answer a different question than the one that we're answering or often they're data sets that are collected with the goal of answering many different questions by the people who were not the ones involved in data collection. When I think of secondary data analysis and most of the work that I do is with large existing panel studies. Panel studies are often thousands if not tens of thousands of participants. Many of them are longitudinal in nature. They're run by teams of researchers. They often have a ton of funding either from NIH or the National Science Foundation or Consortium Universities. They're often cross disciplinary, meaning they have psychologists, economists, medical researchers, epidemiologists, lots of people are involved in collecting these data sets. And they're meant to represent years of time and investment, often millions of dollars of investment. And they're made publicly available for researchers like me to go and use them to answer questions in a number of different domains. When I give this talk today, those are the data sets that I have in my mind that I'm sort of building these ideas off of. But I also want to recognize that this applies to single lab studies. So anytime your lab does a study and adds it to one of the repositories we've talked about today, anytime you share data on something like the open science framework and someone else goes and uses those data, these same ideas apply as well. Those studies might not represent the same amount of investment in terms of dollars and time, but for those individual labs that collected that data, that's probably a significant amount of dollars in time, a significant amount of the labor that they put into it. So we need to be treating all of these data sets with respect when we do so. So at least in psychology, there are a number of large panel studies that are used quite widely around the field. These are some of the ones that I'm most familiar with in your own field. You probably have your own panel studies that you think of as like, this is the thing that everybody uses. In personality psychology, those of us who study health rely quite heavily on the health and retirement study and the midlife in the United States study, those come up quite frequently. Some of my friends who do work in cognition and genetics use the UK biobake sample again and again and again, and we see these, these data sets pop up quite frequently in our published literature. Secondary data analysis is already meeting some of the goals of open science, specifically equity and inclusion and reproducibility. So in terms of equity and inclusion, the fact that we have publicly available data sets, you don't, sometimes you have to pay for them, but quite often you don't. You can just register for access to them. The fact that we have these data available means you don't have to have a large grant to do good quality research. You also don't have to sit around and wait for decades to do good quality research. So I'm a health psychologist. I'm interested in long things like longevity or things like at what point in the lifespan someone develops a particular condition. And I could start collecting data on people now and wait around 20 years to be able to answer those questions. That's not a bad thing to do, but I can start answering those questions right now with some of these data sets. It also means we're expanding equity and inclusion to include researchers who aren't in traditional grants driven or R1 institutions. So researchers who work at teaching colleges who want to give access to their students who want to do research or students in other fields that might not have the same resources that everybody else has, they can still do rigorous, robust, interesting, meaningful research because we've put these data sets online. This is a part of equity and inclusion that I don't think it's talked about. And in terms of reproducibility, it should be very intuitive that if I use publicly available data and then I give you my code, you can go and test whether I actually got the results I got from that publicly available data. It saves researchers the step of actually making their data available because it's already available. So by doing secondary data analysis and by encouraging them in our fields and making it easier for people to do, we're meeting both of these goals. However, there are some, as I said in the title of my talk, there are some unique challenges to using these data sets. So a paper earlier this year by Thompson Wright, this and Poldrack pointed out what I think should be an obvious statement, but unfortunately it took 2019 to point this out, that if multiple researchers use the same data set, we're increasing the family wise error rate for that data set. So hopefully at this point it's not a surprise to you that if you, a single researcher, analyze one data set multiple times and only publish the significant findings, we call that p-hacking, you're inflating the type one error rate because you're doing multiple tests on the same data set. Data don't care where they come from, they also don't care who analyzes them. So if multiple researchers each do a single analysis on the same data set, that's the same computationally as one researcher doing multiple analyses. So if we have a large panel study like this and 20 different labs all go and do one analysis each on that single data set, we've increased our type one error rate. We're probably going to find some false positives in there just because that's how probability works. Again, it took until 2019 for someone to point this I think fairly intuitive thing out, but it's something worth considering. And the reality is we don't have researchers each using these data sets once. I'm going to suggest in just the very next slide that researchers tend to return over and over and over again to these panel studies for their increasing our inflation of the type one error rate. But even if you did only use that study once, even if you only used the health and retirement study one time, you still have a problem which I'm going to refer to as the Curse of Knowledge because these data sets tend to be used by the same small communities of people. So the data sets I threw up there at the beginning, these are things that either I have used or colleagues of mine have used. When I was a first year grad student learning the literature of psychology and personality and health, I was reading articles that used those data sets, meaning before I had even downloaded the data themselves, I knew what some of the relationships were inside that data set. And so any analytic choices I made with that data set were no longer data blind. They were no longer just based on theory, they were based on knowledge of the data set. Whether I as a first year graduate student understood that or not, I was making choices based on what I knew where it's in these data sets. So we have this Curse of Knowledge that we understand the data before we've even analyzed them. And we have this problem where multiple researchers are using the same data sets over and over and over again. I don't like calling out other people, I'm going to call out myself here, this is a bunch of articles looking at personality consciousness and health. All of these studies use the health and retirement study data set. Two of those are mine. So calling myself out here. These are things that I either read when I was a graduate student or worked on when I was a graduate student or came out very recently. Meaning any researcher now in personality, even if they're a currently a first year graduate student who's in my subfield, before they even touch data, they're probably going to read or be aware of these studies and they're going to know a lot about this particular relationship inside that data set. Just out of, I do this with psychology groups, how many people have used one of these like large panel studies, these publicly available pre-existing data sets? How many people think they've read a paper that has one of them in it? Yeah, there we go. Right. So even though you're not using these data sets yourself, you're aware of these relationships and that biases you before you even go in and analyze the data. So we need solutions to this challenge that are unique to secondary data analysis and that help us not solve but at least to deal with this issue, this curse of knowledge problem that we have. So there's sort of two routes we can go. We can increase transparency. So we can do things like one simple thing that's not done in our field right now is providing links to code books for publicly available data. Often our research articles don't tell people, how do I understand what is in these data? What variables were available for you to use? How were those variables coded? It's a very simple, very easy thing to just include a link to a code book in your manuscript. Putting a little bit more burden on the researchers themselves, we should be disclosing the times that we've used that data set before, regardless of whether it was published or not. We should disclose the fact that we do return to these data sets over and over again. If I ever use the health and retirement study, I better be sure that somewhere in my documentation I've said, by the way, here are the four other studies that I looked at these data for. More difficult might be disclosing the times that we've read about this data set, but clearly it's a problem that we're all aware of these relationships before we've even downloaded the data. So finding some way to track and catalog and report, how many times have you read about this study before? What are the instances in which you already came across this study? How might that be biasing the choices that you made in your analysis? And then finally, pre-registering our analyses prior to data analysis. So writing down before, ideally before you get access to the data, but at any point in the process, writing down, this is what I plan to do with the data so we can do a better job of differentiating when we're making choices based on what we knew before looking at the data versus the choices we made after we looked at the data. Some of these changes are already making its way into the system, at least in psychology. I want to point out here gerontological science. Next week there's a conference, the Gerontological Society of America. This has psychologists, medical researchers, epidemiologists, people involved in public policy, and they have a number of talks this year just on open science and a couple people specifically talking about open science and secondary data analysis. So this field, I'm encouraged because I see this field taking it very seriously and trying to engage more with this issue. So we're starting to see some of these practices encouraged and incentivized and rewarded in at least some of the subfields that we're looking at. We also see at least among psychology that many of our journals are starting to accept registered reports with secondary data analysis. So that's maybe controversial in some places, the idea that you can even pre-register data that is pre-existing, but we see journals that are willing to work with authors and try to develop systems that allow for them to engage in some of these open science practices even while they're using pre-existing data. The other route that we can take when it comes to improving our inferences or I'm sorry to dealing with challenges of secondary data analysis are to improve our inferences. And there's a number of ways we can do that, some of which are really exciting. So especially computer scientists involved in machine learning have pointed out a number of new techniques that I see adopted more and more frequently in personality psychology including things like data blind analysis, cross validation, holdout samples. We can borrow some of these practices that are being developed elsewhere and use them specifically with secondary data analysis as a way to just improve our inference to make sure that we're being more conservative with our estimates to let the to use a more data driven approach and to sort of differentiate when we're making choices based on what we already know. There are a couple other ideas that have been coming out recently so I guess it was like two or three years ago. Psychology had the alpha whores where we argued about whether or not we should keep using p less than 0.05 as our cutoff. I don't think that was ever resolved but we still hear people talking about it as a potential way to deal with some of these issues with null hypothesis significance testing. I'm seeing more and more coordinated analyses. This is where people are taking multiple pre-existing data sets and doing the exact same analysis of each one and pooling the results. So now my knowledge may be of the HRS and my knowledge of the midlife in the United States study and my knowledge of three or four other studies. All of those idiosyncrasies of those studies will be wiped out if I have to pick one analysis that works in all of them. It's harder to pick analyses that are going to be in your favor when you have to repeat the exact same analysis in four or five different data sets and pool those together. And then finally, and this one I don't see happen as much but I'd like to see more of it, is just being more accepting of exploratory analyses. Not everything has to be confirmatory. Not everything has to have a significance test attached to it. It should be okay that we just explore data and that can be the beauty of some of these data sets is they let us explore new relationships. I am starting to also see some changes in the research that we're doing in the practices that psychologists are engaging in. So for example this article that came out at the beginning of this calendar year by Amy Orban and Andy Przbilski looked at screen time and adolescent well-being and they used multiverse analysis to look at a whole bunch of different models they could use to predict these and compare the results across all of these different models. Thus not favoring one specific model that might be driven based on what people know of the data set but allow for the whole possible universe of models to speak for themselves and look at the results in aggregate. This and one of the things I want to just point out about this study is it was very well received by the field meaning I think we're more attuned to the problems that come with secondary data analysis and more excited about the innovations that we have to deal with it. The last thing that I want to add in my last 20 seconds here is that I think we need to do a better job of exercising judgment. So it's very easy to want to come up with some sort of method or model or way of doing our analyses or system of pre-ordering iData where we can say yes this researcher did this thing so it must be a good study and no this researcher did not do this thing so it must be a bad study and there's nothing that we can come up with that's going to solve the problem there's nothing that we can do where we can say yes this is good and yes this is bad. All of the things we're developing are tools to help us do a better job of evaluating the research as readers of that research and so I just want to put in a little plug for using these tools as a way to evaluate the credibility and robustness of research not as a way to solve any of these problems. So I just want to thank my collaborators who have been working on this work with me and you for listening. I think this was a really interesting talk so to go back to the research parasites I think you know thinking about data reuse in that way sort of suggests that using data is a consumptive act that there's a certain amount of knowledge in the data and when someone extracts it that removes some of that knowledge and I have I disagree with that but the point that you've raised here about this sort of curse of knowledge and and having so many different studies on a particular data set I think maybe is a different way of thinking about that so I'm wondering like do you think there is a point at which a data set is used up? Yeah that's a good question and one no one has yet asked me but when I lie awake thinking about it at night maybe I mean so some of these longitudinal studies these panel studies I keep picking on the health and retirement study because it's the one I've used the most they started data collection in the early 90s and they're still collecting data today meaning there might be ways in which you could say I'm going to only use new waves of the data that come out and not rely on old ones although you still have some of this curse of knowledge because the participants are the same there's no reason to think that the relationships between variables within people are going to change that dramatically over time so this is being recorded so I don't want to say yes yes there's a point at which a data set is used up but there are limits right and I think it's a point of diminishing returns so the first couple times a data set is used that's going to have more new or credible information and as we move further and further down we're getting less and less quality information out of it the pull drag paper that I think I linked to it in the slides and the slides are going to be on osf so go check out that paper because they did a really nice job they used a lot of simulations to show what again I think is a very intuitive point that the more we use these data sets the more type one error we have but I think they also sort of demonstrate that point at which early on we can have more faith in these results than we do later and later and maybe that's the point at which in terms of confirmatory testing these data sets are used up but that doesn't mean we still can't use them for other things hi I have a question about using pre-existing data sets since the talks this morning um so I'm open to anyone commenting on it so there are if a paper is retracted you get a retraction notice and that isn't always followed through and that data can be used or the information that people can use again what safeguards are in place for data that could be retracted is there anything right um so uh to preface this with again I'm not someone who has done a lot of data collection actually my my co-pi is a research symbiontine I work really well together because he collects lots of data and I use it um so he uh most of what I know I know from him but there are great repositories where you can put data so some that are built specifically things like for neuroimaging or for genetics but there's also data repositories that anybody can upload any kind of data to so the Harvard Dataverse is a great one some of those data repositories are peer reviewed so in order to put your data on those repositories you have to write up a very detailed description of how the data were collected what's inside the data you have to basically write a very detailed code book and those get peer reviewed for um things like readability before they're published so I imagine that if you were to post data in those repositories and then we find something is problematic with it there might be a mechanism there for data to be retracted or for some sort of notice on those data to be put um whereas other mechanisms like posting your dataset on a personal website or even on the open science framework doesn't have peer review involved and doesn't seem to have those checks um one of the things that I've been especially concerned with as someone who uses publicly available data is I think we need to do a better job of incentivizing posting data and rewarding posting data and part of that comes with creating standards for what is a good dataset and what does it mean to properly document a dataset in order for us to be able to reward people who go through the effort of doing that and making research possible for people who might not be able to collect their own and that the ability to retract a data set or to flag data sets that are problematic in some way should go along with that. Up next we have Daniela Siderri and she is the co-founder and project director at PreReview a platform for crowdsourcing preprint reviews. All right thank you everyone for uh holding on in this afternoon uh I'm pretty tired but I'm going to try to go through my talk in a pretty speedy time. So before I start uh these slides are going to be available um on picture but for now uh just on uh under the ZRL and our license under uh Creative Commons by 40. So yeah I'm Daniela I am a neuroscientist by training I finished my PhD nine months ago now at Oregon Health and Science University. I am Italian and then during my PhD I've been working with uh cute ferrets on studying how the auditory system integrates sounds coming from multiple directions but somehow through how my PhD through my PhD also came across an amazing community of open science and open scholarship advocates that have really opened up the doors to what I'm actually doing right now. That is my what I call dream team Monica Granados and Sam Hindo the work with me at PreReview. I'm also a modellian just finished a fellowship last year that allowed me to do this work after my PhD and I also love organizing things like this and and community and hackathons and so that's just one example of science act day in 2016. 17 we've done it both years. So yeah OpenCon 2016 was the first time in which I heard the word open science. Robin Shampo who some of you might know from OHSU she's really my open science mentor and she gave me a scholarship to attend this conference I had no idea what I was going to go to and attend that I found this amazing group of international activists researchers librarians that were really trying to make science better and I was shocked excited overwhelmed I remember all this feeling and one of the most important ones is that for the first time I could actually put a finger on something that I felt as a researcher but I didn't really quite realize which is that science is really not that open and exciting collaborative thing that I imagined before I joined academia but the a lot of scientific knowledge and not just science but I really want to open it up to the whole corpus of scholarship is generally looked behind closed door it's defaulted to this state of opaqueness and encloseness and so we all in this room are here definitely to try to change that but that was like a new concept for me and so take peer review for example which is the very essential part of the research cycle I wish I could take credit for this awesome drawing but it's supposed to really be that that's like a process that allows for us to check the the research and the science that comes out and it gets disseminated so the other scientists can really build upon it and grow our understanding towards knowledge but in reality it's really a process that is mostly things are changing but the default dominated by or kind of controlled by journals and in a way that is pretty opaque to what we might more even more than we might realize so just think about we have journal editors so I as far as I understand there is no real protocol to actually hire it's not even hired because most of them not in all journals I'm going to make some generalization here but there are also researchers who do it for for free and for the good of knowledge and they get in this position of being editors and then their role is to find reviewers to review our science right and these reviewers are found through personal connections and lists that are not shared across journals this is a very like non-transparent process in the sense and on top of that it's very slow as you all know it takes on average six months to a year depends on different different disciplines to really get your work out and the major issue I have a peer review is that it very effectively reinforces the inequalities that have been dominating academia for centuries which is having the power to really unlock science and knowledge and concentrate the hands of very few people this is an old picture but if you look at the makeup of editors and reviewers and then this is I should have put here citations but it's being shown to be predominantly male white from countries in like North American Europe so last but not least even though peer review is such an important process in our research cycle there's lack of formal training in how to provide constructive feedback to other researchers you basically just are supposed to know how to review just because you publish yourself which is absurd to me so we have a system that is so important that is opaque slow unrewarded lack diversity and lack training table that for a second another thing that I learned at open con is about preprints and they were probably the most exciting thing that I came back home with and preprints as most of you know are manuscript that are freely they're freely available online before general organized peer review so now physicists have started this as was mentioned before in 1991 with the archive but now more and more scientists are actually adopting this process we have the internet we can do it we can just put our researchers out there ourselves without waiting for general to go through the peer long peer review process and what's awesome about this is that they're really complete manuscripts they're permanent they're versioned and they're citable they have digital also identifiers in the most cases but my favorite quality of preprints is that can be reviewed by anyone that you can get feedback by anyone and so this is something that came out yesterday I'm sorry it's cut off there but it's from the bio archive team it's finally put out I have to say it was a lot of work so I'm not complaining but they work the data they have been collecting over the past few years and also reports of their impact and they this is just to show why do researchers post-preprints and here in the fourth you know the main one seems to increase awareness of your research but like my I'm interested in the feedback part and I was very surprised to see that I was put in the fourth category here so these I think that the sample size was pretty high above 4,000 researchers of course biased these are all people that are pro-superprint and bio-archive there is a geographical and gender bias but nonetheless this is the data however when I came back from opencon excited about this feedback possibility and I looked at Bombay archive again this is the biased thing but like the amount of commenting like the users of the public commenting box was like 10, 11% and this has not changed very much especially has not changed considering the fact that we are changing so much in the adoption of preprints and this is the last image that I show from this study as that it seems like most of the feedback happens on Twitter which I love Twitter but it also makes me this a fear of missing out that I don't know when I saw that like oh my god most of the discussions around my science are going to it think that I'm not checking 24 hours a day and so I wish that there was a system to actually get conversations around preprints and so one thing that the first thing that I did when I came back from open from opencon was to what I could I was running a journal club in neuroscience and I was like can we stop reviewing already published or talking about ready published work and then we talk about all these things that we found that but we can do anything about it and it's frustrating can we just talk about preprints and then send an email to the authors but even better can we actually post to this this comment on the comment box of my archival or so on and so forth and this is not it was not my idea something that I read from Prachi Dr. Prachi Avasti on Twitter and I was like this is great so it kind of molded into us into a project that Sam Handel the woman in the middle and I started in 2017 and because we were so open on Twitter and we got authority at the time was an independent startup now it's bought by Whaley but they offered this platform for us in less than 24 hours we had a way for researchers to post and get the eyes for their reviews and you know of course there's a small project ran by two early credit researchers we didn't have a ton of engagement but we have I think so far we had like about 60 pretty reviews that I'm putting out put out and 200 members but this was not a platform designed to actually do what we wanted to do right and I I do think that software can actually help make things easier and there is that time barrier that Lenny talked about this is totally it right we there's no incentives to actually review preprints although we are very committed to finding them so the idea is that we think that anyone should be any researcher should be able to post a comment or review or whatever you call it and we can talk about it how it looks like to really help make the peer review process more open more diverse and possibly faster if we connect the pieces of the puzzles in an efficient way so in the past year amazingly we got actually very nice support from the community and funders like the Alfred Prince along Foundation and Mozilla have helped us fund some of the ideas that we had most have to say have been related to software it's much easier to get money for software than it is for community building and I'm going to argue that maybe then it's some change but we're really grateful and so we just put out a beta version of this the new preview which we have designed after a series of surveys and sprints that we ran with researchers from different parts of the world in which we really trying to understand like what are the incentives of the things that we can create here that can make this experience more valuable so I'm going to show a couple of screenshots from the platform it's the UI is going to change very soon so don't judge that too harshly so actually before I do that I just want to mention that really our goal a preview now is not just focusing on journal clubs but it's really to empower researchers especially researchers from early careers and early careers and researchers from communities that are not currently engaged in peer review in becoming in raising their voices and kind of bring constructive feedback to their colleagues and connect them through this platform we're developing a training program that is cohort based similar to what I was mentioned that Julia did for OpenScapes which is going to pair mentors and mentees to really learn about implicit bias and how to give feedback how to receive feedback so that we don't get those oh you should go back to college reviews from some reviewers which I read it wasn't to me but I've seen that review and we're hoping to connect researchers not just with each other but also to general editors that are willing to accept these these reviews as potentially helping their their own process and this is not something that we are only doing by the way there are a lot of things that every day are coming up now but I think we're all the only one who say that anyone any researcher should review and I'm happy to to have asked questions from the audience to challenge that concept so any researcher with an orchid ID can come here so you need an orchid ID which doesn't really ensure that you're not at role but what we hope they will ensure that they're not at role is that you have to agree to a code of conduct and even then of course you can go in and write unconstructive feedback but we have the ability to kick you out not that that might deter anyone but we are very serious about wanting to increase inclusivity and so we realize after the first three months that we wanted to have everything open and transparent inclusive identities that we couldn't do that so we needed a way to actually give the option of pseudonymity to our users which means that the default state is actually the name is not you can choose I don't know like panda 64 and that is your persona on pre-review and then at any point in time you can choose to go public and that it's because a you might as an early career not feel very and this is what came out in the surveys very comfortable doing it immediately being being public and the other thing is that we would like to track how this kind of confidence or how this state actually changes with the change of culture so we're going to see how that works and we can always go back to panda 64 and say you violate a code of conduct you can get kicked out so there is a search engine I'm not going to dwell on this too much we have a way that we're going to change a little bit but like we're going to have reviews happening in context of the actual preprint and again the only thing we continue to remark on is be constructive so we're all now working on some text mining that can actually suggest the language while the reviewer is writing so last thing I want to talk about and I don't have much time is that we're also working for the with the outbreak science community well this is just a slide to show that there is a lot of reasons why preprints are actually great for outbreak science because you have an outbreak going on up there in the three countries and then that's the literature that comes up with the same time access it's kind of delayed and you wish that that didn't happen so we kind of we paired up with outbreak science it's a non-profit organization working in promoting preprints in that space and we're coming up with a new way of actually a new way a new platform to rapidly review preprints and has an incredible UI I'm going to demo this to this afternoon and I'm going to show it right now and I want to end with the fact that really it's about a people I really realize the more and more I do this work that it's not a software problem software helps but it's really community building challenge and it is exhausting and also I we decided to make a conscious decision a creative view to really shift the concept towards like how can we make this space more equitable because I think has equity inclusion our diversity inclusion are this kind of part of open science but not always are the first thing that people think about not that we are doing it we are actually doing sometimes a poor job and that's why we want to challenge our world assumptions so I had the first meeting with this incredible international group last week last month's a triangle of science and we were really challenging like even is that preprint good in any case like I discovered and I'm going to just leave with this that in Kenya I think in two years ago all every university in Kenya tried to put together money to pay for one solution to Elsevier for one year and they couldn't so the implication of public not publishing open access is not oh I want to be against Elsevier it's like the entire country is not going to be able to read your research so those implications are things that I don't even we don't even sometimes I mean I didn't even think about I didn't even know so I really think that there is space there to improve and it's one of the collaborators that were referring before in Kenya and he just wrote a piece and I think I love this quote it's like without addressing the lack of diversity we cannot hope to achieve equity no matter how we hope how much we open our science so I really want to keep these things in mind as we continue to develop these things and try to get money for actually collaborations that are meaningful with organizations across the world so yeah my hope is kind of restored science could be an open collaborative and diverse enterprise but we really need to work together and there is and so I'm glad that there are meetings like this that can really put us together in the same room and talk about it so thank you all and also thanks to the funding and Code for Science and Society is our fiscal sponsor we operate as a non-profit through them which is amazing and Sam Hindal and Monica Granados are my collaborators and they work in my team and Michael Johansson is the director of outbreak science thank you so much for your attention there was a question I think during Sarah's talk about is it possible that we're overusing some data sets and that got me thinking about I think I mentioned this before like if we had a standardized benchmark data set and the idea of a holdout I think you mentioned like holdouts from the machine or community were important where we have a holdout set and we we have this standardized data set that a lot of people might be using but we check up on how those models are performing on the holdout set and then we can assess whether we're overfitting or not is that something that you think is useful in a lot of scientific communities or is that also not a good idea to have people upset of like an organization you know being in charge of like this holdout set or something like that okay um so it's it's sort of like anything else there's not going to be any one thing that we do that's going to solve the problem but I also don't think holdout samples are necessarily bad I have heard of some of these larger organizations considering that idea as well like we'll release a part of the data that people can hack away at to their hearts content and then when they're ready they register this is the analysis I want to do we run it on our servers and give them results not the data and they go ahead and publish on that I think that can get us farther than just using the data collectively we still get to that point where enough people have analyzed the holdout sample and suddenly you know you know what's in it um you can you can you know conceive of ways to take that further and further like we're going to only analyze part of the holdout sample and it's a random part every time and each of those steps can get us uh can preserve the longevity of these data sets um but nothing's ever going to stop the fact that we will know too much at some I I don't think someone will should either I don't think there's going to be a point at which a data set can live forever nor should it because people change over time brings change over time people change um but we can we should we should be thinking about these things right we should be thinking about those kinds of options and how far they'll get us and which ones are worth committing I think those kinds of questions of would this work are the right ones that we need to be asking right now so I would say from the field of neuroimaging we've been cursed with the overfitting problem for decades um and it's something that we've really as a community just started addressing in the last five years or so and so I think the concept of benchmark data sets that have protected holdout set that can be independently validated uh is critical especially as we move to more uh decoding or representational and the type of analyses that have become very popular um using proper cross validation is is is a good tool internally but if you're going to say I'm building a decoder for detecting nouns versus verbs you should have a holdout set that they don't have access to that you can kind of test against I think the field wants to move in that direction but it requires the group to kind of stand up and curate it um and that's a that's an actual big investment to ask so it would be easier if there was a funding agency or some other kind of NSF or NIH who would kind of adopt these kinds of standard personal fields for DARPA to be able to say okay these are the data sets we're going to bend parking to use against I have a question excuse me prompted by by Timothy's talk but you all might have comment on it and I do have seen in a couple places most recently out of the climate community is as we think about these two researchers trying to share data and I love that slide visualizing that we may need to get away from files altogether and think about file access is just one kind of interface and with the metadata and the file names and all the formats giving people some sort of web-based access or programmatic access to data where they never even get to see how it's stored and how it's stored my change over time invisibly to them is is the future do you see that in the near future in your field do you see cultural barriers to that so academia moves really slowly because we move by retirements and funerals so uh I would say that I see it in the future it's been you know we've had scientific clouds for a decade now and it's very slow for centers and groups to move to a cloud especially neuroimaging where there's lots of uh health information concerns so for example we we have our data instance in the google cloud and it would be you know everything technically just sits up there so it'd be very easy to have these web-based interfaces where you don't actually download the data you get analyzed from there one of our partner uh imaging centers here in Pittsburgh is one of the hospital and you try to tell the administrators the hospital that you want to push their imaging data to the cloud and I think three of them died of a stroke you know it's just one of these things where it's very hard to get people to think about that but I think you know especially for this concept of having very large big data sets like the human connected project sits on the amazon web services so you can mount your hard drive you don't mount to the s3 and then grab the data from there but more and more it would be better if you had a data curating group and if you had uh you know tools that you could upload and run in a cloud instance rather than scraping it down that's going to make things very quick in our field but again that's me looking at it saying that would be great for everybody to do but I think it's going to take a long time for a field to move in that direction how do we distinguish between data sets that are subject to this overuse versus something like the human genome or the yeast genome which becomes more of a resource um like I don't think you can overuse the ncbi yeast genome right like the number of papers etc so most of the work that I've done has been from the standpoint of like probability right so null hypothesis significance testing invasion analysis and all of those the assumptions underlying both of those ways of thinking about probability are that the analyses are chosen independent of the data and that tests statistical tests are independent of one another and those are the assumptions that we violate when we return to data sets again now if you're using a data set as like a standard or even exploratory or just descriptive if you're not doing a hypothesis test then you don't have to worry about probability because you're not invoking probability in how like stats to choose everything I think about it in terms of like how do I how do I explain to my students that what their advisors are telling them to do is a bad idea so I think I think there are some pieces in which yet we can't overuse a data set if we're trying to describe things or build on it or just show what is there and not infer then to a broader population I think that's that's the key step is when we start inferences that we can fully use things all right so uh I said I had neuroscience such as was the student last week basically um so I actually interested in my lab was competition neuroscience and we did a lot of uh what we would say data exploration right we collect this u2fs and then like look at what's interesting but we it's really rare that we can actually like maybe that one now is changing we could actually write that in a proposal or a grant like they want like you know the day but like the funders and then people want to read like this was my hypothesis and I feel like a frog when I'm going back because actually it would it would completely change the way how we look at data sets but it's interesting how like even if you are doing that exploration sometimes you're like forced to convert into an hypothesis testing any other questions okay so one one of my questions something that's kind of come up a lot during the day was about data curation and also the other another side of that is software and both of those things take quite a bit of effort either to do the curation or to maintain the software and so a few different people have mentioned that it's easier to get grants for software but my understanding is that's not easy to get grants to maintain the software after you write it and then similarly for curation creating these standards maybe in the first place you could get a grant for but then how do you maintain them over time and I guess it comes back to the community question but I'd kind of like to hear from probably all of you uh your thoughts on on those issues and sustainability for these I mean I can just give it a shot but it's always the hardest thing like I actually uh putting in a grant only like this is only going to dedicate time and money to hire a consultant to help us with the question because it's like it's really hard um and I think there are a few small funding agencies in Europe that are only focusing on like they don't they don't fund things at the beginning but they fund this is kind of a lot more long-term maintenance but setting it up as an open source community is really hard and so that's what we're trying to do for the software point of view and there is actually a specific set of consultants that can help communities build uh groups build um there's a sustainability plan around how can you actually then engage other people to continue not just using it but continue investing in the software and it's and it's really hard to get it's not an answer but um something that came up earlier today was thinking specifically about training for graduate students I think there's another answer to that question which also answers your question which is we need to we need to not just train people on how to like create standards or create software or uphold those standards we need to incentivize it so that language needs to make its way into job ads and then people are hiring people at universities or in industry one of the things they should evaluate is not just how many publications you have but whether or not you post your data online whether the data you posted are data other people can use um and we look at we evaluate the quality of people's published articles we can evaluate the quality of their data um my department right now is currently adding language to its tenure and promotion sections to say we hear new things that we're considering worthy of productivity that are things like um giving workshops to teach people about open science creating software um creating just tools that people can use to make their data more rigorous and we're also adding language to the way that we evaluate research that people are doing to say it's not it's not just that you need to have a lot of publications and good journals but we also are evaluating who on whether or not you're using preregistration or whether or not you're posting preference and making this transparent um so if it won't be as much of a burden on our students and it won't be as much of a leap for people to push forward on these activities if we actively reward it in the places that matter most which is whether we have jobs that you want to follow up yeah that that I mean about right um evaluate that you're going to evaluate the quality of not just the papers but also the quality of the data right what metric do you use to evaluate data right so I don't know if Casey's still here but so the research symbiote award um already has figured out how do we evaluate the quality of data you know things like are people using this data set is it available and how many people are actually using it and what is the quality of the publications that are coming out of this data set that you're not necessarily a part of um is it something that someone else can open the data set and read is it the kind of thing I can send to a collaborator and they don't need to have no kind of processing software I used or what standard I used to get that data can they start doing stuff with it um there there are much smarter people than I who are already thinking about how we can do this we should we should ask them but we should try to start ruling those ideas together and bring them I I think it's don't take my word for it but this is for Daniela you mentioned that you have the option on pre-review of either being synonymous or making it open but once you make it open you can't go back right now yeah um so I always sign my reviews but for one of the articles the editor actually wrote to me saying I find you signing reviews noble but in this particular instance I the senior authors have vindictive jerk and I strongly urge you to reconsider so maybe that's a tricky line maybe sometimes you're actually mostly public but maybe there's some instances where you don't know the reason why it is that way right now is because um so we had a limited amount of funds for that mvp um and actually making sure that we could uh do it in if uh that we could be uh uh their privacy and security like once you go public there is a moment in time in which in the internet the name has been associated to that um to that piece of of text or like or and so I know you're talking about more about like for a peer review based so it was more about like that the software engineer that was doing it was not sure that the way he was doing it could actually absolutely ensure that if you go back you can um you can you know um retain that privacy uh but with the new group at outbreak science uh we actually found the way and and we're gonna have a way to like switch uh not in a peer peer review way but in more in like you can go back and forth and so we're doing also user testing to see how people are react to this um I don't know there is a little bit I also thought that um it would be more interesting for me to like study like how people their behavior like they're just having a battery thing so I was like selfishly thinking about it but when we interviewed people actually was incredible like uh most of it was biased because we interviewed people that were already bought into the open science and they were like oh I'm gonna be public from day one uh and I was like really you're like first year student but I think um when we did the software this the survey that was uh anonymous we got a lot of people that were like super concerned they were like I've never put my name so I don't know we we're gonna see how it works but yeah the reason why it's like that is more for a technical point okay I think we're gonna have to wrap it up now but please feel free to continue the conversation over the break um we will come back at about 235 and there's fresh coffee and popcorn in the lobby um and if you want to talk to Hajin about a demo you can do that now too thank you so anybody presenting a poster or demo please come to see me at the back of that table I do need your to set up and I need your keywords after a demo