 So, I'm Gustav Nisone, it's a pleasure to talk about open data in neuroimaging today. I thought since we're not that many, we're trying to do it rather conversationally. I'm going to give a more formal version of this talk on Wednesday at 10 to 12 in case you're interested. So, we'll talk first about why, what are the reasons for publishing and using open data in neuroimaging and then we'll spend most of the time on how, what are the principles to think about when publishing and using neuroimaging data, what are the ethics and so on. So, the reasons for open data in neuroimaging are, as far as I can see, the same as for open data in science in general. What do you think? I mean, let me ask you to get a feel of the room. How many of you have ever published open data at all in any kind of research? One fourth. And how many have published open neuroimaging data? Zero. Alright. So, there's some experience with publishing open data in the room. What proportion have ever attempted to use openly published neuroimaging data? We have one project on that today. Alright, did about one fourth. Do you mind what are the most salient reasons for publishing open data? You can use it again and there's less waste of research money and researchers time. It has reuse potential presumably. Summoners can reproduce the results. That about covers it, I think. Data have reuse potential and that is something that I like to split into two categories. One is that when we try to estimate an effect, we ideally want to base our estimate on all the available data. So, if we try to understand what the effect of X is on Y, then ideally we would want to take all the experiments where this has been tested. And we would want to factor the mean to an assessment or a judgment of how the evidence looks for this effect. And the only way we can do that is if the results are reported and preferably the data are available. A lot of the time, if we try to do this formally by means of a meta-analysis, it can happen that people have done different things. And in a meta-analysis we need to try to manage that heterogeneity that comes from experiments having been drawn in different ways on different participants at different times and so on. And if all the data are available then it's often possible to model different covariates. And then we can manage differences between groups that might be due to characteristics of the participants such as their age or sex or other things. So, open data allows us to deal with heterogeneity when we try to reuse data to understand the effect. It also reduces the risk of bias. So again, if we try formally to put data together and understand an effect, we are always limited by the risk of bias due to data not being accessible. And furthermore the risk that the data are inaccessible for some systematic reason. For example, people not publishing data when they don't show the results that they wanted. So if I have an experiment with a certain number of data points, I have to interpret those results in light of all the data that might exist that I don't know about. I may not know whether it exists or I may know that it exists but I may not be able to embrace it that causes bias. So you might say that the effects that I can observe in the data that I have at hand are influenced by the data that exists in a drawer somewhere in different places. The data that are inaccessible project uncertainty onto the data that we can access and use. And the only way to manage that bias is to try to bring as much data as possible out from the drawers and into the lights. If we can estimate how much data there is, that helps and if we can access it, then we can reduce the bias. I have an example that I'm working on right now. It's the effect of sleep deprivation on resting states of functional connectivity. We've done an experiment with about 80 participants and there have been other reports in the last 5 to 10 years. There are 15 to 20 reports out there and a total of about 400 participants reported and an unknown number of experiments not reported. This literature is not very consistent. People have used different measures of resting states connectivity and they have found different results. So taking out a whole of this published literature is rather hard to interpret. I think that is largely because many of the experiments are quite small and there are so many different outcomes that people have looked at. That there may be a bias in the reporting. Now, in order to understand what effects are really there. The present state of affairs is clearly somewhat unsatisfactory. We would have to do a very large experiment I think to get solid evidence. But even if we did, a large experiment with let's say 400 participants again, roughly the same number that has been reported. That would go a long way but it would still be affected by bias from all the experiments that have been performed but from which we cannot access the data. So from the point of view of inference, ideally speaking, the only really good solution would be to bring out all the data that already exists so that we could put them all together and see what we can find. And this would of course also represent an economical use of resources. We wouldn't have to do another big experiment and we wouldn't have to expose more participants to risks, even though the risks of MRI scanning are quite small. So that's one reason to publish open data. Incidentally, among these papers I've reviewed them quite closely. Not one of them has open data published. One of them is reported in a journal that has an open data policy and it has an accession number to a data repository. But there is nothing there, unfortunately, which illustrates that policies can have varying levels of follow-up to actually control that. So usefulness for inference, as it were, and reduction of bias is one reason for publishing open data. Another one which is still in the broad category of data being useful is that you might use the data to look at something else that's interesting. Does it happen often to you that you have a scientific question and it's clear that someone else has already gathered the data that could be used to answer it? Have you experienced that? I can see a lot of nodding going on. Yes, I have experienced it too. I have another example there. Last year we published a meta-analysis of the diurnal variation of interleukin-6 in the blood. And I'll not go into any details about that except to say that there have been many studies where people have measured interleukin-6 in the same people several times in the course of the day. And a lot of them were not primarily interested in diurnal variations. But when we are interested in diurnal variations, those data are very useful. And in some cases, we were able to access them and use them again. And that helps a lot. And I personally think it's very hard to predict what someone else might want to use your data for. I think there's always value in the data that you haven't... It's very rare that you extract all the potential value from data. And so that's another reason we've published it. And a third reason, as Anders mentioned, is that it helps someone else verify that your results are correct. Of course, there are always errors in anything if you look hard enough. But the publication of open data allows us to find those errors more quickly, I think, speaking about research aggregatedly. If I've made errors, and I certainly have, then I'm happier if they're spotted by someone than if I continue to believe something that's not quite right. And maybe more importantly, I think publishing open data helps prevent errors in the first place. A lot of things can be caught if you carefully prepare the data for publication. I've spotted a number of inconsistencies in my own work when I've been preparing data for publication. And that has helped me to avoid embarrassing mistakes. Well, for instance, only the day before yesterday, I had a data set that I was re-analyzing. And I'd already submitted it to the journal and I'd put the data up and then I made some revisions. And then I noticed that there was one column that was supposed to be data from the putamen. But I had accidentally put it in hippocampus twice. And I only noticed this after the data were actually published. But it was in the round of revisions, so we were able to catch it before it was too late. And that happens to me all the time. I think I'm perhaps more error prone than many other colleagues, I don't know, but it helps. Many of you will have noticed that there's a lot of debate in social media about research misconduct. And to my mind, a lot of those cases could have been prevented as well by publishing data openly. So that can really help to avoid embarrassing situations. Even in collective efforts, sometimes there's still errors. It's kind of good that they are out there and you can reuse them. Indeed. There will be errors. I'm reminded of my old supervisor who was in charge of the Cervical Cancer Screeding Program in the Stockholm Council. Once a year there would be a quality review meeting where they looked at all the misdiagnoses from the past year. And his stroke of genius was to always bring a cake to that meeting to celebrate the high quality rather than to feel bad about the mistakes. Or rather perhaps to celebrate the continuous work to improve quality. All right. I think those are the main reasons to publish open data. You might also think that it's important for one other reason, which is that a lot of people are starting to require it. Journal policies are starting to require it. Swedish Research Council is taking the view that open data is a good thing. And they will not start to require it very soon, but they will start very soon to require data management plans when you submit applications. And then it helps to know what strategies you can use to describe how you're going to publish your data. So what are the principles to think about when you want to publish data openly? There are some different guidelines that it's possible to look at. One is the fair guidelines that suggest that data should be findable, accessible, interoperable and reusable. These guidelines put a lot of focus on the appropriate metadata so people can find the data and understand what they are. Field specific guidelines may be very helpful. And perhaps in the case of neuroimaging data, more helpful. I would like actually to take as a starting point another set of guidelines which is that of the open data badges. They are maintained by the Centre for Open Science and have been adopted by, I think, around 20 scientific journals. The badges are little things that are put on the scientific paper if it fulfills certain criteria for openness. So you can have an open data badge or an open materials badge or a badge for pre-registration. And I like these criteria partly because I was involved a bit in defining them and there were a lot of useful discussions. So to earn the open data badge there are a few things that you need to do. First of all you need to publish the data in an open repository. A repository that has open access to the data and the repository must allow for the data to be archived in a way that is time-stamped, immutable and permanent. Now we chose that to be a best practice. The purpose of these badges is to describe practices that are worth striving for rather than to describe the practices that exist in the fields today. So as to have an incentive. And we consider different ways to publish data openly. You might put the data up on your own server, on the department server, you might put it in a supplement to a scientific paper. In some cases you can publish the data in the paper itself, which is not a bad method if it's a small dataset that fits in a table. But that's the exception in our field. So the use of a repository ensures that there is someone who takes responsibility for making the data available in the long term. If you put it on your own website then you have to maintain it and that's quite hard in the long run. The data should be time-stamped, immutable and permanent and I think it's probably obvious to you why these are desirable qualities. Then you can go back and check the data as they were in the point of time when they were used to make a certain inference that someone is putting forward in the literature. And of course if the data are permanent then they will not be lost. Where do you find a good repository for your data? Have you tried? Does anyone have a preferred method? I have a good experience with, for example, a write or even a byte. Which is a very good effort of multi-center about autism. They have all the finite data and they have a quality control. This is huge. It's kind of really cool. Because they're reviewing all the data, even if it's from different centers and everything. Can you remind me what the byte acronym stands for? It's basically a problem autism. I don't actually remember. Is it the autism brain that something data takes change? Possibly. That's not the most important thing to sort out at this moment. I think that's a very good example of the highly specific repository that is doing well. How did you find that repository? We're involved in a project and we wanted to have a good amount of data and the quality control is really good. How does that work? Do they have a data curator that goes through all the data? Yes, so they're developers through the data process and different steps and can access actually from different parts of the process. And you have the anatomical and you have the informer as well. That's very useful. And my impression from looking at things like post recessions in conferences is that this is a resource that is being well used and that is helping a lot of our colleagues. More generally, if you want to find a repository, I can recommend, for example, the list of repositories that is recommended by the journal scientific data. There are a number of general repositories as well as field specific repositories listed there. I suggest if you have a field specific data that you use a field specific repository because then you will not have to make so many different decisions on your own about formative. I'm starting a collaboration with the people behind the open FMRI database, which I personally like and which I'm happy to recommend. This is a database that, contrary to its name, does not only take FMRI data, it takes any kind of structural and functional neuroimaging data from humans. I don't know if they also understand other kinds of data. I believe they're planning to change the name to reflect this broader scope. Since I'm starting a collaboration with the keepers of this database, I thought I'd try to help anyone who has a data set that they would like to publish there. I would be very happy to be a link between you and your data and the repository. The open FMRI database also has a high degree of professional support within its curators that help to take your data and put it into the desired format in which they're going to publish it, which is the brain imaging data specification bids format. I'll leave it to Anders to say more about that in the next talk. This is a structured format that helps facilitate reuse. You can publish FMRI data in field general repositories as well and in a less structured format, but I would not recommend that. I can mention one recent experience that I had when I was a reviewer for the Journal Plus One, which has a rather strong policy these days, saying that the data should be openly published. I was a reviewer for an FMRI experiment, and the authors said that they had followed this policy, and they pointed to publication on the fixed trial repository. And what they had published there was the final team maps that was behind their figures that they had published. And so I looked at that and the maps were fine. They were accessible. I could download them. I could use them. But importantly, the final team maps are not the data that I used to arrive at the results. And so what does it mean to demand that data published openly? This is perhaps still not a completely settled question in every place that has an open data policy. What I suggested in this case was that they should put the raw data up in a suitable repository, because that's what we need in order to reproduce the findings that are in the final team maps. Thank you for that question, raw data or pre-processed data. That's another point that I wanted to make actually, not least with neuroimaging data. We have many processing stages, and it's probably going to differ between what you want to use the data for, which the stage of processing is most valuable. But if you put the rawest possible format out there, then it will be possible to reconstruct the processed stages. So that is, in most cases, the most valuable, the way to preserve most of the value in the data. It has happened repeatedly in the last 20 to 30 years in our field that people have come up with better methods for pre-processing of fMRI data as well as PET data and other data. And it can be really interesting in some cases to go back to our data and do the pre-processing again. My recommendation and the recommendation of the open data batch is to put up the data in the rawest possible format. So just a question, because in fMRI you generally put up nifty files of Diacom. Is there a reason why, because I mean there's at least one transform. So is there a reason why fMRI has decided nifty not Diacom? Is there a reason in fMRI why we decided to put up nifty files and not Diacom files? I couldn't say really. It's appropriate to archive Diacom files. I'm not sure whether any information is really lost in the transformation to nifty files. That's a technical question that is beyond me. Probably most or nearly all of the information, maybe even all of the information is retained. I'm going to repeat some of that for the benefit of anyone who's trying to listen. So it was pointed out that the Diacom headers might contain information that is identifying and that would not be retained after conversion to nifty. Right. This brings us to the question of the identifying participants data. And to ethics more generally. And what I'd like to point to here is the general principle that we have to weigh the benefits versus the risks and harms. So what we like to think is that the participants agree to participate in our research because of the benefits in terms of knowledge gain. That's why they undertake risks and harms. This may be invasive procedures in some types of brain imaging. It may be risks of incidental findings. It may be risks of pain and discomfort during scanning. And it may also be risks that the data are somehow to the detriment of the participants in the end. This is a risk that exists and that we have to consider. This has to be outweighed by the benefits consisting of knowledge gain primarily. So our responsibility to the participants is to bring out as much as possible of the benefits. We can do that by publishing data openly. So one very important point here that I'd like to make is that when you consider whether to publish data from humans, you don't just have to think about the risks involved. You also have to weigh that against the benefits. And I think sometimes we see an overly legalistic perspective from some of our governmental agencies that tend to weigh the risks only and not the benefits. But what we want is to weigh the balance. So we don't want identifiable data from human participants to be identified when we have published it openly. We want to prepare the data in such a manner that you can't find out who is who. So that's something that you can think about in a structured way if you want. You have a population that is the target population if you're trying to identify someone. The target population, that's the participants that you've had in your experiment. And there's a reference population that is usually bigger. That is the population that you're trying to match against. The reference population might be, for instance, the whole population of a country. Now, in order to have a successful matching, then a participant needs to be unique in both the target population and the reference population. If they are unique on some set of variables, then you can have a perfect match. If they are not unique, then you can still have a match with some degree of probability. This kind of reasoning is perhaps most applicable to registries and population-based studies where you have information such as people's age and sex and area of residence and maybe occupation. And then with a number of variables like that, you will eventually be able to make a matching. To reduce that risk, we can take out variables that are unique. If we have variables in our data set about, for instance, participants age or their biometric measures such as their height or weight, and these values are unique in the data set, then we can censor those columns. Or we can categorize them. We can lump the participants into age categories of 10-year bins, for instance. This will help against having unique sets of data points between columns in our data set. But sometimes we can't completely do that. And of course in brain imaging, we need to have a picture of the brain. If we don't have a picture of the brain, then we really don't have much data left to work with. And the brain has a unique trip for each person. So how can we manage that? Well, we have now to weigh benefits against risks. We can reduce the risks by taking out information that is not pertinent to the brain. Okay, what the OpenFMRI data set database does is to cut off the face region of the head. This means that if you try to match a data point to a reference population where you have information about someone's face, that will not be possible. But if you attempt matching to a reference population where you have a high-resolution anatomical scan of the same person that is not too far off in time, then there is a risk that you will be able to do a matching. So then you have to assess the risk that someone will be able to do that. Then they have to know that the person they're looking for, they have to know to combine two data sets that has the identifying data. And we also need to consider the risks that happen if identification is successful. So the risk of re-identification from an anatomical MR image is probably quite low. The reason being that in order to know if a participant is in the reference population, then you probably need to have some kind of insider within healthcare or within another research project that suspects that the participant exists in both data sets and is able to do a matching. It's not possible generally to obtain an MR image of a human without their knowledge, which also limits the future risks. Now, this level of risk that we're left with might be unacceptable if we have some very sensitive data about the person in the data set. But it might be acceptable if we don't have very sensitive data in the data set. So that's the next variable to consider. What is sensitive about the information that exists in the data set? Typically, sensitive variables are such variables as, for example, information about the participant's health status, which is quite common in our medical university, but it could also be something like information about their sexual orientation or about their private affairs one way or another. So another way to protect the data is to remove the variables that are sensitive and not publish those. Now you can think about different threat models. Who is it that is going to do this matching and how likely is that? Most of the time when I try to do that, I arrive at the conclusion that the most dangerous threat models are either an insider or that the participant themselves tries to re-identify themselves in the data set. And in this later case, the problem is not that bad because they have a right to know anyway if they ask. The data that we've gathered on our participants is owned by the university. The university is the Foschnitz-Hübert Mann, the responsible party that runs the research project. This means that the information about the participants is all men handling. It's something that anyone can ask to see and we have to show it to them. Except for sensitive information. So we don't own the data ourselves and as the law now stands, we can't make sweeping promises to our participants that we won't give anyone the data. And that's something that not everyone knows. There are examples of consent forms that say things like no one outside the project is going to see these data. And that's a promise that we can't make. Do you have to do everything in detail? Yes. So the ethics review board application is what gives us... Okay, I'll put it this way. Whether or not you write in the application that we're going to publish data openly, there are still public documents except for sensitive information. This means that it's typically not a hindrance if you want to publish data that already collected. It's typically not a hindrance if you haven't mentioned that to the ethics review board. You can if you want to make an amendment to your application to really cover yourself. But the ethics review board is not the main thing in my opinion. It's the consent by the participants. That's really the important thing here. If it's a private research institute, then typically they own the data. Then it's not necessarily open to the public. So we want to respect the participants autonomy and therefore we want them to know exactly what it is they consent to. And therefore we want to tell them that we are going to publish their data openly. We also want to protect their integrity and therefore it's good if they know what information is going to be in a data set so that they themselves can assess the risks. And it might, at least in theory, inform their future behavior. For instance, the decision whether to undergo another MRI scan if they know that one MRI scan of their head has been published. There are some good examples of language to use in consent forms in English. There's no example in Swedish, but I'm working on trying to put that together with the help of some people at the Department of Medical Ethics here and a skilled jurist who works with open data. Eventually, when we have something to show for, we will put that up online. All right. So this is all to say that the best thing we can do for our participants is to bring out the most value. And we usually do that by publishing as much as we can from the data set without while keeping the risk of the accidental re-identification as low as we can. My experience is that you can identify, sorry, you can de-identify a data set very aggressively and still have data left that are valuable. For instance, I mentioned last year we did a meta-analysis of interleukin-6 in the blood. And one of the questions that faced us was, is this variable exponentially distributed? More or less. Because if it is, then we want to long transform it. And it looked like it was. So we did. But it would have been better if we could have based that on some independent data. And all we would have needed would, in that case, have been the data themselves, one column of data points with no relation to anything else. A data set of that kind could have been de-identified to the point where nothing else is left except the variable of interest. And we would have still found that very valuable. Okay. I think I've covered the things that I wanted to say. And so please consider it an open invitation if you're thinking about publishing data. Come and talk to me. I'll see if I can be of any help. And that would be fun. Are there any thoughts or questions? So firstly, I agree with all the benefits of open data and so forth. But I always like to think of a possible negative. There's this one negative which has been kind of, I've been concerned about for a while. So to get the open data part, but you can imagine a professor with a data set has 10 possible hypothesis on this data set. And we'll obviously have to correct all those hypotheses if he does. But instead gives out to 10 different PhD students and each one of those become analysis so that he don't correct it because these 10 publications come over the span of five, 10 years, and people don't think they're connected in this in the same way. Isn't this a possible problem with open data that we have too many false positives because there are so many hypothesis data. Yeah. So the question is really does open data lead to more uncontrolled data mining with with poor inferences and false positives. And it might. But I think the answer is not to take away the data. The answer is to to control the inferential process. I mean, I could you could you could use the same arguments against any data really why why should scientists have data if they do exploratory analysis. Couldn't we take away their data so they can't. That's not that I mean, so what procedures. So don't we also need procedures to it seems stupid to say but to control the possible multiple comparisons. Yeah. Yes, but we can have 500. Yeah, it is still problematic. I agree with you. That's that's problem will not be solved by open data but I do agree with Anders that pre registration goes a long way and also open data does increase the likelihood that you can do a confirmatory analysis on some other data set that's out there and I think that helps to pre register many and basically more people working on the same issues. And as long as we do it within a rigorous framework where it's clear what we have done and where that's talking. I don't think that that becomes a problem for interpretation. Well, it is possible comparison. I mean, if you're not connecting that it's five and two thousand people doing data. No, sure, but then that's up to present is interpreting all those results together to do that comparison internally. Right, as long as everything is on on register as long as we know what has been done, then it's up to each researcher trying to integrate all the work that has been done. And the problem with that, though, is we don't like I think if we have to not do a multiple comparison correction for every single pre registered results and resources on the data that they're going to be decay your possible influence on the data. Because let's say I just do a giant forum on really, really bad hypothesis on some data sets that I don't like this and I do one million one million tests on that pre registered them all the only correct. But now somebody that has to also correct my million hypothesis I've had on that data. So that means possible actually getting a correct influence become the case if you actually try and take this into account. If you could do a personal exploratory analysis. Then when you know that you have a good reason to find it. And then you can use. Because that's, I agree the value of those days, I can be an exploratory person, but if you want to not do an exploratory analysis. How do you deal with multiple patterns, either future multiple comparisons that people could do on the same or all the comparisons somebody else's data. Search. This is an interesting discussion. I would prefer not to cut it short, but it is time for Anders to have the floor. So let's move on now to his presentation.