 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at RCE-cast.com and you can find our entire back catalog as well as subscribe and your favorite podcatcher RSS reader. I also have here Jeff Squire from Cisco Systems, one of the authors of OpenMPI. Jeff, thanks for your time. Hey Brock, here we are another episode. Weather's turning nice and that means that SC deadlines are probably approaching, right? Yeah, yep, yep. So a number of SC deadlines are starting to come up very quickly here. So if you're planning on submitting something SC, be sure to get your proposals done and get those submitted. All right, what do we got for today? So we have with us today Dr. Gorgolecki of the Stanford Reproducibility Center. So Chris, why don't you take a moment to introduce yourself? Well, thank you for inviting me to be here. My name is Chris Gorgolecki and I'm a co-director of the Stanford Reproducibility Center for Reproducible Neuroscience. And my goal in research and science is basically to get more neuroimaging data publicly available to more researchers. So why is reproducibility important? Well, basically in the business of trying to discover the nature of the universe, we perform lots of studies and we want to make sure that if we say something today, we can use the same data tomorrow and arrive at the same answer. It's basically the the fundament of doing science. So in the recent past, there's been a number of high profile retractions in the neuroscience area. Was this you've blogged about this a little bit and written about this at reproducibility.stanford.edu. Was this part of the motivation for creating this center or was the motivation before that? The motivation was before that. The issues highlighted in that paper were basically bringing up that certain analysis methods are flawed and insufficient and the paper itself was a little bit sensational, sensationalist in nature. So the situation is not as bad as it is described. But it does show that it is beneficial to be able to take existing data and reanalyze it again with newer methods or corrected methods. And that's where we come in. And this is where the ability to replicate the results in a formalized fashion is something very useful, especially with historical existing data. So now the the most easiest one to talk about, and I think the specific article that you're referring to is the 15 year old bug in the FMRI algorithms for analyzing data and whatnot. How how does something like that happen and how is it not found immediately? I mean, the situation was not as dire as it might seem. There was a confluence of different things and and the paper did find some problems with a certain piece of software. But there wasn't the explanation of all of what was the major finding of the paper. But to answer your question, how things like that can happen? It's basically the researchers bias. So if you get results that confirm your theory or if you get something that is quote unquote, statistically significant, you're less likely to go back and search for a problem with your analysis. However, if you don't find what you expected or your results are not so significant, then you go search for a bug. So if you have a bug that biases your results and gives you higher rate of false positive findings, fewer people question those results. And it's harder to to find something like that. The other thing is that in neuroscience, human neuroscience, so human neuroimaging analysis is very complex and has many different stages. And this is just one of those many stages. And it's hard for a single individual to go through all of that code and debug it. And so I think you have answered my next question already is, you know, why exactly is this hard? You know, why is reproducibility hard? And if everybody is using, you know, the same tools and the same things, then everybody gets the same results and it makes it difficult to realize that there's actually a problem in there, particularly when the results are what you expect. Is that a correct characterization? More or less unpack what reproducibility really is into two things that you need to have to reproduce someone's results. You need to have the data, which is the crucial and very, very important part of it. And you need to have their code or codes, whatever you're going to call it. And that that second part can actually get quite complex because, especially in scientific software, codes or code is evolving very, very rapidly. We have a plethora of different versions. And there are a few papers in your imaging that are showing that things that we didn't expect to influence the results, such as, for example, the order in which you're going to link a certain binary with mathematical libraries can influence final results. Same with the system that you're going to run your analysis on. So we have a lot of different variables and a lot of different pieces of software produced and delivered by different groups that at the end give you the final results. So we somehow have to capture all of that to be able to replicate the results. So why did you focus on neuroscience? I assume that's your area, but why put so much effort into reproducibility for neuroscience specifically? That's a great question, and two takes on that question is, like, why people care about neuroscience at all. And that's kind of an obvious one to answer, because the human brain is the greatest mystery in the world. And we all have one and we all wonder how it works. So it's definitely a great problem to work on in general. But why trying to solve a disability in this particular domain instead of in a general way? Well, we're trying with everything we do. We're trying to be very, very practical. We're trying to be as close to the real scientists and asking real questions as possible to be able to relate to their problems and their limitations. For example, we don't want to ever be perceived of coming from this high horrors and telling everyone that they should use this particular procedure, or everyone should have 100% coverage in unit tests and every single script they write, because this is unrealistic. And in the same way, we want to provide practical solutions in a particular domain, because this way it's easier for our audience, target audience, which are the scientists to understand it and apply it to their research. So kind of in conjunction with my prior question, this is really more focused on reproducibility, not a strict check of correctness, right? Because correctness is also incredibly important in scientific research and discovering the nature of the universe and the human brain and all these kinds of good things. But what you're trying to do is make sure that people are doing it in a reproducible way and the correctness check is a secondary, but a different effort. Is that correct? Yes, that is correct, because correctness is an incredibly hard question. And you want, of course, we're doing some work trying to evaluate methods and trying to come up with the best way to analyze data. And part of what the infrastructure we're building is is aimed at providing neuroscientists with easy access to the best methods, even though those might be computationally expensive or difficult to deploy. So in that way, we're trying to improve the quality, but you're absolutely right. I can have a completely reproducible technique that is completely wrong. So on the center's website, you stay with the goal of harnessing high performance computing to make mirror. Neuroscience research more reliable. How does high performance computing make research more reliable? Well, basically, if we can give people access to high performance computing in one single place, we treat it as a resource, like basically like water in your tub. This way, we minimize the variability of what people are using to execute their code. And it's also is treated as a carrot. So in other words, as remember when I said, for disability, it comes down to access to data and access to methods that were used in the original study. So part of our goal is to increase access to the data and we're using high performance computing as a sort of a carrot, as a reward for sharing your data. So our scheme that we operate or soon will operate on there is, it has a very simple principle. You upload your data into our system, you can run all of those cutting edge method using high performance computing and in return after a certain grace period, that data will become publicly available because without having access to the data, there's no way someone else will be able to replicate the results. So let's talk about the data repository in a minute here, but I still wanna touch on the secondary problem that you've identified, this very hard part of, I wanna run the same program and get the same results and assuming that I have the same data. And you said something fascinating there in your last answer that you want to provide access to tools and potentially even have a single service where people are running all their stuff so that I think one of the implications is that you can have apples to apples kind of comparisons of results. But how do you account for even the drift of technology over time? So assuming that like you get all neuroscientists in the world to run on this particular HPC cluster, well, that cluster is gonna last say two to three years and then they're gonna upgrade it with the next generation of technology, whether that's the next processor chips or the next generation of numerical simulation libraries or whatever that therefore may give slightly different results, even with the same programs and the same data. How do we wrap our brains around that? No pun intended. Woo! Well, it's a great question. And it sort of points at the fact that sometimes we don't want reproducibility because we have improvements. It's again, back to the old, it's reproducible, but it's wrong. It's reproducible, but it's imperfect. It's reproducible, but it's suboptimal. So obviously if you wanna do apples to apples and keep on the bleeding edge, you have to rerun the same analysis with new versions of software again and again. And we're also just to kind of a side note, we are simplifying our problem just to the software layer or actually just to the software layer above the kernel. So we do assume, and I know this assumption is not 100% correct, that across different architectures and across different hardware, we're gonna get the same results because if we were to actually guarantee the same hardware, that would be just too much to tackle. But software is plenty. So we've got our hands full. So what tools have you created for the neuroscience community to encourage this method of operation? So I can kind of describe this ecosystem from kind of top down. So our vision, as I said, is provide science as a service in exchange for making your data publicly available. And that service, which hopefully we'll launch in publicly this summer, allows you to upload your data, edit it online. And the first step for possibility is being able to take snapshots, immutable snapshots of your data. Because to be able to go from results through methods to the data we have to know in what state this data was and dataset change, people have subjects, et cetera, et cetera. So we have this science as a service platform, but we have to populate it first with data. So to be able to do that efficiently, and this is why we're doing this within your imaging, we have to come up with widely acknowledged standard for describing and annotating your imaging datasets. So someone from a lab in Texas could upload the dataset and our tools would understand this. And someone from Helsinki would also upload a different dataset and our tools would also understand it. So we developed this standard, it's called Bids or Brain Imaging Data Structure. And that allows us to interface between the data and the computational methods. And then we started populating the service with tools, with basically those logs that people run on their data. And to make this widely used and appealing to people, again, we had to have the best methods across the field. And it's a heterogeneous field, it's not that gonna ask every scientist and they go, gonna point to this particular software package and say like, this is what I wanna use. And this is the only one. No, there are multiple competing packages, there are multiple ways of analyzing this data. So again, we reached out to the community and we had this workshop last year where we brought the leaders of methodological development in your imaging. And we asked them to take their existing tools, put them into software containers and make them work with our input data format. Having those tools, now we can, in a much easier way, put them on our platform and make all of this available to everyone. And on top of that, you're also building some of our own tools. Actually, we call those tools BidsApps because everything needs to be an app these days, I hear. So we are building some of our own BidsApps and Lab trying to make them as robust as possible. So this is different than the usual data management plan requirement of, keep all your data, keep all your code, zip it up somewhere and tell people where they can get it. You're going a little bit more in-depth in terms of trying to get people to do it in the same framework so that you can run it and other people can run it. And like you said, your tools would understand people from different parts of the field at the same time. Is that a safest difference between the other attempts at increasing reproducibility and data reuse? Yeah, we're trying to have this carrot rather than stick approach. People want to do science the best way they can. It's not that scientists are evil and all of them are just trying to pee-hack everything away. It's just the tools are missing. And we are trying to build those tools and we're trying to make those things easier. And here the carrot is processing and access to cutting edge methods. So basically if someone doesn't have the resources to run this very computationally expensive method that is hard to install and maintain and they have this option of just uploading the data to our servers and we're going to run this analysis for them. It's a no-brainer. They're going to do it because they benefit from it. And we take care of all of the reproducibility. The data is snapshot it. The method itself is in a container and all of the containers are strongly versioned. So you can always go back to that particular version that generated this result and to that particular version of the data itself. So we attract people by the methods they care about but provide this reproducibility for free for them. Can you give us a little more details on this? Like what kind of platforms are you going to allow people to run on and how much storage are you going to provide and where are you getting the funding for this kind of stuff too? Yes, so that's a great question. So this kind of research, in general it's not that easy to get funding for this sort of, we want to make science better research from typical sources but we're very happy to receive a generous grant from Laura and John Arnold's Foundation. And this is what is funding this right now. And this is mostly the development stage where we're building all of the infrastructure, all of the software around it. But the idea is that the architecture of the system will allow us to tap into existing publicly available high performance resources. So basically, there are plenty of programs where you can apply to get an allocation, for example, on exceed or on tag where you can basically get free cycles on a cluster. And this is how we plan to maintain this project in the future, basically paying for the compute. And most of the money right now goes into development of the infrastructure and the software. Storage will be a bit of a concern in the future. We haven't quite addressed the kind of intermediate storage but we have some interesting conversations with companies such as Amazon for providing free unlimited storage for publicly available data sets. And what is the... So all of this kind of leads to the... I'm a researcher, I have a whole truckload of data and I don't have resources to do the computation on or maybe I do have the resources to compute it. What would inspire me then to put this system, put all this data and whatnot into your system? Well, I mean, if you don't have the resources, then the resources would inspire you. But if you do have the resources, it's a little bit like having an external backup and a little bit of having a nice environment to do and track your computations. So I do hope that people who do have access to the computational resources might still opt to use this service because we're gonna provide this data management layer for them and otherwise maybe they just like us and that's what they're gonna do. Is there any fear that there will be more takers than givers? Like if the data is freely accessible, is there any fear that that will inspire some greedy actors to just take all that free data but not contribute any of their own? I see that as a benefit. We love takers. We love those, how do you call them, scientific data parasites. Yes, we absolutely adore data parasites and people who take freely available data and reuse them. This is why we're doing this because if they reuse the publicly available data, that means that more science is happening for less money. And this is a great saving. We recently estimated the data shared through the previous generation of this platform called OpenFMRI saved something around a ballpark of three million dollars. If that data were to be acquired, then novo, and this is how much it would cost but people just took existing data and did new science on it. So we are not concerned about having more people download data than people contributing data. So you mentioned OpenMRI. What is OpenMRI? So OpenFMRI is the first truly public database for task-based FMRI and later evolve into any MRI. And this is the form it is in right now. It's a very simple repository for storing human neuroimaging data. And one of the things that the way it stands out from other repositories that's completely public. So everything is public domain that is deposited in this repository. So there are no barriers or no hoops that you have to jump over to be able to access the data. And that's what makes us stand out. And we've been doing this for a good few years and building collaborations with journals and associations. So it's also a respectable place to deposit your data. And we're seeing more push both from journals as well as funding bodies to deposit data and to domain-specific repositories. So have you seen a shift in culture towards using not necessarily your environment specifically but towards using environments like what you've proposed? So I see the shift in sharing data. I don't quite see much of a shift in terms of using science as a service platform as yet. Although I hope it will come there. Like we see it is on a smaller scale. We see smaller services that require less data and less computational power, but that just provide it as a web application. So instead of having to download software, you go to a website and then upload a few megabytes of data and you get some fun result. Or maybe you do some simulation on the website. A colleague working in the center as well, Joachim Durnes, she developed a wonderful app for optimizing the design of an experiment. And that's all running on the web. So it's not as sufficiently expensive as the analysis that we're planning, but it shows that lowering the barrier of access to providing web interfaces is attractive for scientists. So do you have plans to move into other areas of science, either further more parts of neuroscience or new domains? We're probably gonna stay focused on human neuroscience, but that doesn't mean that we are at any level close to exhausting its breadth. It's a huge field. And right now we are branching out from MRI to other ways of acquiring brain-related data such as MEG or PET. So we're trying to combine these different modalities, but we're probably gonna stick to the existing solutions, but everything that we are learning here, we are very happy to convey those lessons and hopefully other fields could also learn from it the same way we learned from other fields, especially from genomics. And we have planned some integrations with more broader open science frameworks, such as the open science framework, which are not neuroscience specific. Now, jumping back just a little bit here, in terms of having giant repositories of scientific data available, particularly as this data accrues over time, does that actually make it more cool or improve the culture of reproducibility itself because this wealth of data, and then if you're bolting on the second part too, like, oh, well, the code that was used to analyze it is also freely available too. Does that just, you know, is the hope that the barrier to entry to reproducibility becomes significantly less and therefore inspires more people to do that? Yes, that is definitely the hope and the goal. And we already see that. So I can give you an example. There are a lot of methodological development. So basically someone proposes a new way of denoising data or normalizing data or analyzing data. It's gonna serve a purpose to answer a particular cognitive or neuroscientific question, but first we have to validate the method and we have to test the method. And that's in itself the scientific result. And without having access to open data, you could publish a paper showing, oh, I validated this method by running these analysis on this data, but no one would be able to actually replicate it on the same data because they will not have access to that data. So if they try to replicate it on a different data and they get a different result, they don't know whether it's the data or the method. And I'm not saying that replication on separate newly acquired data is bad. It's very valuable as well, but having access to benchmark data sets is crucial for methodological developments. Now going hand in hand with that, if you have access to open data, are you also developing these tools and the numerical analysis and computational aspects of this in an open source kind of way? Oh yeah, like we do everything in open source way. And it's wonderful. It's basically like joining this very big family of developers and hackers, especially now in the age of GitHub, it's just wonderful to see people collaborating on new features and new faces coming up almost every week on the bigger projects. So all of this is open source and it's not just open source in a way that, oh, here's the source, you can have a look at it. But it's basically open source in a sense that we have an open community of contributors and we'll open to new contributions and we perform the coding in an open way. Even when I would make a comment or have a comment on a line of code written by someone in the same office, I would still convey that in open way to GitHub because our goal is to build a community and work in this transparent way. So what license do you make the code available under and this is perhaps a slightly naive question, is the data available under the same license or a similar license? The license we use for code is usually Apache 2.0 because we are open to both open source as well as commercial applications. Whatever pushes the science forward. But software licenses do not have the right leglies to be applicable to data. And in terms of data, we're following the lead of other institutions and journals, such as CERN or nature, and applying the public domain license. So either PDDL or CC0. And that basically gives you the right to do whatever you want with the data. So you can use it for commercial purposes as well. So Chris, thank you very much for your time. We didn't get into a lot of the detail of how your software as a service for a reproducible neuroscience actually works as a technical contraption. But where can people find out more about the Stanford Reproducibility Center? So you can go to our website or at disability.stafford.edu, which is a high-level description of what's going on with very pretty images. Or you can just dive into the code and find us on GitHub. We are under the organization Poldrag Lab. And that's where most of the things are. All of the new imaging apps are available under the organization BIDS apps. And you can always shoot me an email. I'm happy to answer all questions. And if you're doing something similar or you have some comments, how can we be more efficient or better what we're doing? I would love to hear them. Okay, Chris, thank you very much for your time. Yeah, thanks for your time. This was great. Thank you for having me.