 Thank you very much for the invitation. So in this talk, we are going to focus on derived data. And I'll define this a bit later. And what are the challenges if we want to reuse derived data that was shared previously? But before we get started, I would like to start a little introduction by looking at this chart that was published by Paul Draken colleague in 2017, where we see sample sizes in new imaging evolving from the mid-1990s up to nowadays. And the first thing that is interesting when we look at this graph is that this number is in constant increase. So this is pretty comforting. But if you look more carefully at the last point on this curve that correspond to the point in time that is close to us, which is 2015, we see that we have about 30 subjects per studies. And it's pretty easy to imagine. And I think many of you in this room are aware that with such numbers, it's really hard to model the variability that we see in the general population. And it also goes issue because we have low statistical power. And then this questions, this creates studies that sometimes we have difficulties reproducing. And maybe the results are not as generalizable as we would like them to be. And our community is pretty much aware of this. And about 10 years ago, let's say, they started to focus on how we can create large-scale studies rather than focusing on this small-scale studies. And part of this is to make available data by some groups that will be reusable by other groups. And nowadays, these shared data that are available to all take many different forms. So on one end, we have the data that was shared at the end of one study, a given study. So these 30 subjects we are talking about. By definition, these data sets, we're going to have many of them because there are many of the small studies. And they are going to be very different from one another. So they target different populations. They use different imaging protocols and so on. Then 10 years ago, people started to work, maybe earlier, to work together and create consortium. So a very good example of that is the 1,000 functional connective project, which is an international consortium where people started to work together to share resting state fMRI to have bigger sample sizes. So in those cases, we have bigger data sets. We have less of them. And they are not completely harmonized, so we are not talking about the exact same protocol. But they have similar goals. So here we are talking about resting state fMRI. And then one step further, even more recently, we've seen this cohort imaging. So here we are increasing even in the order of magnitude of the number of subjects that are imaged. The goal is really to create data sets that are going to be resources for the community. I usually use state-of-the-art imaging protocols. And they are extremely harmonized. Because before scanning, so for UK Viobank, the target is 100,000 subjects. Before scanning 100,000 subjects, they think really carefully at which protocols, which imaging sequences should be run on the subjects. So protocols that are extremely harmonized. And all of this data together is the biggest resource to inform us about brain function. The question now is, how can we make them work together? And what are the challenges if we want to make them work together? And that brings us here. So to think about the challenges, I would like to go back to my small-scale study, the historical one. So typically, I'm going to acquire some data, analyze it. So here feature extraction is defined in a very loose way. It could be many different things, depending on the data you're working on. And so it's a transformation going from raw data to derived data measures that you think are of interest to understand your research question. And then finally, you have a statistical test or another statistical approach that gives you some results. So we have something that is extremely homogeneous. The data can be easily acquired, analyzed, and published within the same lab. Now, if I want to work with open data, what happens? The simplest thing I can do is go and grab some raw data that is similar to mine and inject it at the level of feature extraction. Here I have to take into account new levels of variability. The open data was acquired in a different environment using a different machine. Maybe the imaging protocols were different and so on. And this is for raw data. But maybe, because this raw data was acquired in a different environment, maybe I will not be able to use the same feature extraction pipeline. And I will have to use another one, and then I'll combine the data at the level of the statistical analysis. When I do that, I'm working with derived data. That was produced by a pipeline on which I have no control. And this creates a new level of variability that I will call analytic variability in the following. This is the variability induced by the pipeline that was used to generate the derived data. And this is going to enter and to play every time I want to reuse derived data. So another use case, maybe the data that was open was directly the derived data. This can could happen, for instance, for a question of privacy issues. Maybe we could not share the raw data, and all we could share was the derived data. Another use case a bit more practical is that if we think that we are moving towards bigger and bigger and bigger studies, it doesn't make sense that every lab in the world redo the pre-processing again and again and again. And here again, if I work with pre-processed data that was shared, I don't have control on the new imaging pipeline that was used, and I have to deal with this analytic variability. The last use case is meta-analysis. Here again, we can see it as another use case where we are reusing derived data, except that our derived data is one step down the pipeline. It's statistical results that we come by. Here again, we have no control on the pipeline, and we have to deal with analytic variability. So I've talked a lot about this analytic variability, but what do we mean here? So let's be a bit more precise. I like this sentence that is from the CARP 2012 paper that talks about different acceptable methods. So what could be different acceptable methods? So here we have the pipeline as I've drawn it in the slides earlier, and it's a really high level view. But in practice, it looks more like this. So I have many little steps, one after the other, and depending on the type of data, the steps could be very different. So let's take an example. Spatial registration, segmentation, cross-modality registration, it could be anything. Now let's go back to our generic pipeline and look at the first item on this pipeline. Actually, I used a given algorithm, but I could have used another one. Or I could have used the same algorithm but implemented in a different software. Or I could have used the same algorithm, the same software, but not the same version, because actually, they've just released another version and it's much better. Or I could have used same algorithm, same software, same software version, but a slightly different set of parameters. Or finally, same software, same version, same parameters, but in a different environment. And all of this is gonna create differences in my results. So you can imagine how big is the parameter space. So here I just have a couple of examples. So the first one is our canonical pipeline. The second one is the same pipeline where for the pre-processing, I changed some of the parameters. Third one, I just changed for some pre-processing two steps, I used another algorithm. Fourth one, I did the very same pipeline using the previous version of the software. And it turns out for the third item, I had to use different set of parameters because at that time, this item was different. The fourth, one, two, three, four, fifth one. Fifth one, it's the same as the initial one, except that we used a different machine for the pre-processing. Maybe we run the pre-processing on an HPC cluster and then we did the statistical analysis locally on our laptop. Now, I'm finishing with the final one, very same pipeline, but using a different software, new imaging software. And as it turns out with this new imaging software where for some of the items, we don't have the same algorithm, so we have to use different ones. And of course, those are just a bunch of examples and there are many, many, many more. So the real question that we have in this used parameter space is, well, how does this analytic variability actually impact on our results? Does it matter at all or does it not? And a number of authors have looked at these questions in specific contexts, so either looking at version of the software, at how operating system affect the results, at how different pipeline in fMRIs affect the results, but here I would like to talk more about recent work we've done looking at running the same pipeline in fMRI across different fMRI software. So that's joint work with Alex Barring and Tom Nichols from the University of Oxford and is currently available as a pre-printing by your archive. So here we looked at three studies. So we picked three studies for which we knew the raw data was available. So this came from open fMRI. And for those three studies, the goal was to reproduce the initial pipeline that was done in the original paper. So let's say the original paper, the pipeline was done in FSL. First, we would start to try and reproduce the pipeline in FSL using the methods in the paper. And then we were trying to, we tried to compute pipelines in AFNI and SPM that were as close as possible to the original one. And then we compared results using parametric and unparametric statistics. And to compare the results, we've done different things. So the first one was to look at the figure that was in the paper because, well, this is basically our ground-throughs, what was in the paper was what we wanted to reproduce. And typically in paper, we found 2D representation of the 3D space of the activations. So here at the bottom line are the figures from the paper. And then you see the reproduction in AFNI, SPM and FSL. So the first column, which is DS01, the figure in the paper was obtained using FSL. So if you look at the FSL results that are on the second row, they are the closest to the original one. This is good news because it kind of shows that we managed to reproduce the original pipeline, which wasn't sure just by reading the paper. And then we see the other results using SPM and FSL. And we can see that, well, they are broadly consistent, but there are differences. So then we wanted to go one step further because comparing 2D images is not easy and also really much depends on which slides, slices you decided to look at. So we looked at DICE coefficients and how well the activation overlap. So we got different results that are not as consistent as what we could think when just looking at one slide. And then finally, we looked at unspecialized statistic maps and how well they match each other using blunt-almond plots. And here again, this is something we could do because we had access to the results. So more and more people are sharing their maps and neural vault, but we didn't have access to your original map. So these work, along with the others I've mentioned earlier, help us understand a little bit better how analytic variability impacts on the results. But there is still much, much to do in that space. Here we are compared by blinds across new imaging software and we try to have them as close as possible. But in the wild, that's not what people are doing. In SPM, people might have in that community different habits from what FSL users might do. So we can think that the variability that we see in the results are even going to be stronger. Ultimately, we would like to know whether this variability affects our conclusions. We would like to know whether specific steps in the pipeline are going to affect the compatibility of results later on to combine derived data and so on. So there is a lot to do in there. We know that the pipeline affects our results. So what can we do today? And this brings me to the next point, transparency. So we know that the details of the pipelines affects the results. And what we can do right now is disclose as many details as possible so that in the future, if we know how pipelines impact the results, we can correct for variations or we can combine data depending on what pipeline was used to produce them. And details about analytic variability right now are in the code that you wrote. They are in the script and they are also in your environment. So the libraries that were used, the tools that were used and so on. And to have this information in the paper, what we need to do is to cite the software. So how do we cite software? So here I'm referring to work by Smith and Coleg where they've built these software citation principles. So I know that now there is a working group focusing on how to implement the software citation principles, but let's see how we can implement them right now with the tools that we have. So just to go through these principles, the first one is that citing software is important. So I think probably most of you are aware here that software is how we build our research. So citing it and giving attribution and credit for people who wrote this software is particularly important. Then they emphasize that we need to have unique identifiers to make sure that this metadata gives us accessibility to the software itself, that they remain persistent, being specific. So this is basically mentioning the version that was used. And of course, what I was mentioning at the beginning, make sure that we give credit and attribution to the people that actually built the software. So how does it work in practice? So let's first focus on how we can cite our custom code that we made for the analysis. So here I'm citing small snippets from the barring print that I was talking about earlier. So in the methods section, what we included on the results section is a link to Zenodo. So Zenodo is a repository that aims for persistency. So they have a really nice features that lets you integrate with GitHub. So if your code is on GitHub, you can link it to Zenodo and every time you do a release on GitHub, Zenodo is gonna take a snapshot and store it. And why is it not sufficient to just point to the GitHub URL? Well, because GitHub is not built for persistency. We have no guarantee that in five, 10 years from now, those URLs will still resolve. So Zenodo is a way to make sure that they will remain persistent. So if you do that, then you have persistency and accessibility. The nice thing is that they create this page that lets all those sites, you work. So let's say someone else want to build on your scripts. Well, they can go to this page. They get a unique identifier with a DOI. They get a list of authors. So you could change, you could have a different list of authors from the one of your paper. That's what you want to do. And you also get this specificity with the different versions that come from GitHub. Now, what we've done for our custom software, we also have to do for the tools that we use because they are part of the environment and they also enter into play in analytical variability. So here's an example with new imaging software. So the first thing we do is, well, naming the software because we are all human and that's how we like to talk about them. Then second point is to be specific. So which specific version of the software was used. Then provide a unique identifier. So there is the RRIG initiatives that actually is here exactly for that to provide unique identifiers for database software. So this is one thing we can add. And finally, give attribution. So here you can see that we've cited papers but different software, people who build software libraries or software tools might have different ways they want you to give them credit. So the best thing to do is just, well, Google it, find their page. So here it's the example of the FSL page and they list the papers that they want to be cited but it could be as another link that could be different from one version to another version with different authors attached to different versions. But of course, this is not only true for new imaging software, it's true for all software that you use and for all libraries that were used by all software. And that's where it starts to get more and more complicated. So here we tried as much as we can to list libraries and their dependencies with versions and to give attribution with the actual paper. And this brings us to something that we discussed earlier, which is provenance. So I'm gonna read the W3C official definition. So provenance is a information about entities, activities, and people involved in producing a piece of data or thing. So this is very formal, but in practice, provenance is a description of all these steps that led us to a piece of data. And this is important as we've seen because it impacts on our results. And this brings us to the idea of machine-readable provenance. So why is it useful to have machine-readable provenance? We've said that we could cite our custom code and this is, I think, state of the art right now and really something we should do because that's the best way to be transparent. But if we want to go one step further, if someone wants to look at what is the pipeline that you actually used, well, they have to understand your code. So they have to understand the computer language that you used, maybe your Python user and they are a MATLAB user and then you have to, or a shell user and you have to understand how they wrote that. It's also true across new imaging software. Maybe you know SPM really well, but not FSL and so on. And having machine-readable metadata allow us to build new tools that will help us crawl and query this metadata and so on. So here I want to finish up by talking about the NIDM effort for new imaging data model which is a long-standing efforts trying to harmonize the way we represent metadata in new imaging. And this effort is supported by INCF and I would like in particular to talk about NIDM results which is the component that is currently released. So NIDM results is a harmonized model that represents results of mass-univariate studies. So these are typically the results of fMRI studies also used for VBM. So yes, and this is joint work with all the INCF NIDASH task force. So a lot of people are involved in this. So what's NIDM results in briefs? So you have this huge graph with a lot of pieces of metadata but to summarize the information is there in there is the information about the statistical analysis and then the thresholding. So voxel-wise-versual, cluster-wise-versual, what was your p-value and so on. And all these images that summarize this result. And in practice, this is available as a ZIP archive. So this is NIDM results and what's nice about it is that it enables new use of the derived data. It facilitates reusing derived data and the driving use cases was meta-analysis because when you have results for fMRI results from several studies, well, you can combine them and get a quantitative summary through meta-analysis. So here is an example with scripts and because in NIDM results, we have information about the statistics or images of the brain with values everywhere in the brain but also the peak and cluster. You can do both an image-based meta-analysis or a coordinate-based meta-analysis. Another tool that is now in our ecosystem is the ability to view an NIDM result. So what we call a pack. This is a ZIP archive, it's called a pack. Either it's an SPM-like view if you really like SPM or using it with the FSL-like view if you like FSL better. And this is regardless of where the analysis was initially done. And we can see NIDM results as acting as an interoperability layers between the new image-intuit tools that historically work in silos. You do everything from the pre-processing up to getting the results within one software. So this is great. How can we use NIDM results? There are tools right now to be able to export so natively from SPM and FSL. Once you are at the phase where you're getting your results there are options that let you export, get this ZIP archive and then NIDM results is compatible so you can just upload it in there. And then finally you can cite it in your manuscript as another piece of derived data that is available somewhere on the web. And we've seen throughout this presentation that basically the output of the research is now not only the paper in PDF format but you have bits and pieces everywhere in different places on the web. So you might share the raw data or you might not but you could share the raw data. You could share different pieces of derived data maybe you're sharing the full preprocessed data maybe you're sharing statistical map on your vault. You can also share your code using Xenodo and so on. So there are bits and pieces everywhere and what tied them together currently is mainly the manuscript. But here is an effort called OpenAir Connect that focuses on how we can link these artifacts together. In particular there is this new informatics portal so I should say that this effort is across many disciplines. So this is not a new and reaching specific problem or research domain have artifacts everywhere on the web that they want to gather. But there is portal specific for new informatics where people can go and say, okay this piece of data that is available there on the web belongs to me and it's linked to these other software and so on. And this is still in beta but I would encourage you to go and check it out. So to finish up there are many ways you can contribute to this efforts open science efforts if you're interested. So you can make your research artifacts siteable and accessible to others. You can participate to portals such as OpenAir Connect to make them more easily discoverables. You can use these metadata standards being bits and IDM and so on and give feedback. Use the tools that go with them. And also you can join those communities and help build the next specification so that they will be tailored to your needs. So I've included a couple of links here. I'm happy to talk with you at any point during the meeting. So to finish up I have a list of references that you can check out later and photo and template credits. I would like to acknowledge all my collaborators including of course the members of the Indian working group as well as my collaborators in Oxford. And to finish up thank you very much for your attention and I'm happy to get any questions. Thank you so much for the great talk. I had a couple of questions. One, so it seems that Providence or talking about what the process is can be done two ways. One is is you can describe it so you can have a sort of a workflow or capture the workflow in some way. Another is is like freeze the machine which is like a docker container that do you think there's an opportunity at some point to essentially have the docker serve this frozen wax museum of processing and that people could then essentially introspect those to find out what really happened in the processing and is that potentially better than a description you sort of a graph or a description of the process. I think both are useful and complementary. Like the docker containers are great for reproducibility. So we know we'll get the same results and this is important because that's where we usually start. But then if I'm interested in if I get different results and I'm interested in why I really need all those details but they could work together. Like we could have descriptions of what's in a docker container with always Providence. So I don't think they exclude each other. Okay, and the second question is if sort of we're building tools and pipelines so that with the intent not just to get data out but to produce endpoints where we can compare we can check reproducibility. So two pipelines produce outputs. And as you said traditionally now it's the paper the illustration in the paper it's not very useful for comparison. Do you have a sense of what we can do to make our data more ready for that kind of comparison to make it easier for people to follow on and then validate? I think it goes back to sharing to sharing as much as we can and intermediate steps as we can because that's how we usually check whether two things are identical or not checking at which point they diverge basically. So using the metadata standards that are available and sometimes they are not and trying to document as best as we can. So you mentioned different ways of uniquely identifying your software and you mentioned both minting DOIs and assigning RIDs and I was trying to figure out what's our view on how these things are complementary or they actually serve the same purpose and whether there is the need or necessity to use both. That's a good question. I think you have to have it's good to have unique identifiers. RIDs are maybe more suited for software like libraries and tools that many people are gonna use. I'm not sure we're gonna create like an RID for custom code we created for a paper. So probably they are complementary. Yeah, so for a paper we'd go to Zenodo and then if it's a tool, we can just check it out. Like if it's a tool, is it available on the RID portal or not? Oh yeah, nice talk. So on the analytic variability, I suppose the assumption I've always had is that it's kind of a bad thing you wanna control but looking at your results there, I was thinking, is it actually not a good thing maybe to run your analysis in different tools to guard against over interpreting your results? I think it's not a bad thing but we should be aware that it impacts our results and it might impact the conclusion that we give. So ultimately, we might want to build tools that will be robust to different pipelines. If all these pipelines are equally correct, then maybe we should have results that are again robust across all of them rather than just picking one and saying this is the truth. Hi, thanks for your talk. I'm actually interested in your opinion about the shareability of certain data such as ASL that is not only, there is not only the issues of analysis but also things such as the shape of the arteries are going to be impacting the way you quantify your data. So is that data shareable at all? So I think it's shareable but we need to have all the necessary details that will help us reusing. So it maybe it's a bit of the same like reusing derived data but it's even stronger because your raw data is different but maybe if you share enough bits of information you will be able to combine them across different labs and so on. So I think it's still possible but we need to understand better which parameters affect the images.