 compared to, say, what's contained in DICOM. But more importantly, there's no sort of standard way for storing higher-order metadata. For example, the details about the task, what happened, when, or phenotypic information about the individuals who've been scanned. And it also doesn't have a way to specify relationships between files, right? So you can't look at a bunch of nifty files and necessarily know which ones in a big dataset go with which particular series or which particular subject. You need additional information. Is it a T1-weighted or a T2-weighted image? So we started addressing this in the OpenFMRI project a while ago and developed sort of an ad hoc data organization scheme that we used to share data, right? Which basically said, you know, the data should be organized by, you know, folders per subject and then subfolders for each run. And you have particular types of metadata around tasks and around subject information and so on, all in, you know, either text files or nifty files. And this was specific to a particular FSL-based data processing stream that we had sort of developed to do automated analysis, you know, within our group. But there's a number of shortcomings of that scheme. Most importantly, it was focused on task fMRI. So there's a lot of stuff that it just didn't cover very well. And it was really developed in an ad hoc manner. You know, we built it within our group to sort of address the specific needs of those particular data sets that we were working with. And also importantly, there was no way to validate whether a data set was compliant. Basically what we had to do is take the data set, try to run it through the automated stream. And if it broke, then we knew that it wasn't compliant and we had to go back and figure out what was wrong. So this, you know, in, I'm going to talk in a bit about the OpenNURA project and sort of in thinking about OpenNURA and wanting to do a better job of sort of end-to-end automated sharing and analysis, we realized we needed a much better standard. And so we started working with the community and I'm going to talk a bit about how that worked to develop this thing that is now called the brain imaging data structure or BIDS. And so BIDS is really two things. One, it's a standard for how to name and organize files that contain imaging data. And second, it's a standard for how to organize the metadata that go with those imaging data sets. So first I want to talk about some of the principles that we adopted because I think they might be different than, you know, other principles that other people might adopt in a similar project. And one is that adoption was crucial. We didn't want to build something that no one was going to use. And so that drove a couple of things. One, it drove us to really focus on engaging the community and trying to make this not a poldrack lab project but a project of the entire imaging community. Second, we wanted to keep it as similar as possible to what researchers were already doing in their lab. Since, you know, we know that if it's good, informaticians are going to buy into it. The question is, are researchers on the ground going to use it? And so that's why some of the things you're going to see, for example, using flat file organization with kind of human readable names. That was done because that's what most researchers already sort of do in their labs. And we wanted to make the step to using bids relatively small for them. Second, don't reinvent the wheel. This is probably obvious to most people. Use existing formats wherever we can. And third is the sort of abiding by the 80-20 rule. And so there's a couple of ways to frame this, right? One is we want to build for the most common use cases. And not let edge cases drive us down sort of, you know, paths that are going to take a bunch of time and do relatively little work for the majority of people. Another way to say it is don't let the perfect be the end, be the good. We know the thing is not going to work for everyone all the time. But if it works for most people, most of the time, that's good enough for us. So in January 2015, we held an initial meeting at Stanford. A number of people in the room were there to sort of start laying out bids. This was funded by INCF and by our Center for Reproducible Neuroscience. And that sort of led to the development of a draft standard that we passed around amongst ourselves. And then later in 2015, opened up to the community, sort of tried to get out the word by a social media and various organizations, and solicited open feedback from the community. Got a good bit of it. And then in 2016, published this kind of first version, the first official version of the standard. And I've also published a paper in scientific data. So this is the sort of fundamental idea of bids. It's basically going from, you know, DICOM coming off the scanner, which is sort of organized in a relatively non-transparent way, and named certainly not in a very human interpretable way, to something that, you know, you don't really need to know anything to be able to look at this and kind of be able to tell, you just need to know what imaging data kind of look like. And you can know, well, these are subjects. This is an anatomical image, and this is a functional image, and that's a diffusion image. And participants.tsb probably has some information about the people who were in the study. So that's sort of the basic idea of what a bids data set looks like. So if we dig in a little bit more deeply, first you see that we standardize on NIFTY, right? There's a lot of arguments about shortcomings with NIFTY, but we thought that NIFTY is the thing that almost every software package uses. We're going to standardize on NIFTY. And so the imaging data are currently all stored in NIFTY. We store, to the degree that we can store stuff in easily and in tab delimited text files, we do that. So for example, the participant.tsb file stores information about each of the individual subjects. There's kind of, you know, minimal metadata that we want, like their age and their sex, but you can actually store, you know, anything you want in here, preferably linking it up to some ontology that tells us what you're talking about. And then, you know, for some types of more dense metadata, we store them using JSON. So for example, the metadata around the NIFTY files that we would get via the conversion from DICOM, we store those in a JSON file that's associated with each of the individual files. Now I should mention that there's sort of an inheritance principle here, such that if it's the same for all of the images, you can put a version of the file higher up that'll sort of be inherited by everything below it. We also have templates for how it is that you name directories and name files. And again, these derive from one, trying to maximize human readability at the same time that we sort of, you know, enable the kind of power that we need for the format. And so this is an example, you know, sub.participantlabel, session.sessionlabel, data type. And then for individual files, for each type of file, we also have a scheme. So there are some things, you know, some parts of the scheme that are required, like the participant label and the modality label. But then there's a bunch of stuff that can basically go in and, you know, you can add or not in different parts, you know, so that you can sort of express. For example, if you have different types of, you did, you know, 10 different inversion times for your T1-weighted image, you can express that in the file name. Obviously, it'll also be in the metadata, but to the degree that you want to separate those out in different files, you can do that. So, you know, I mentioned before that a standard is really only useful if you can kind of tell whether a data set meets the standard. And having to run it and look for errors in an automated processing stream is not really a great way to do that. So we realized early on that we needed to build a validator, and the validators really come to drive the whole process to the degree that I kind of refer to it as validator-driven standards development, sort of along the lines of, you know, test-driven development and software development. So, and particularly in the context of, you know, automated data ingestion and data processing, the ability to have a robust validation upfront is really important. So we worked with our partners at Squishy Media in Portland who have done a bunch of the development at Open Neuro, and they developed this JavaScript validator that can run in the browser. So the idea is you point your browser at a data set. It looks at the data set, and it doesn't actually have to load the data. It just looks at the header information and the metadata, and can very quickly, even for a large data set, tell you does this data set meet the standard, and if not, what's wrong with it to give you sort of a clue as to, you know, what you need to fix in order to make it a valid BIDS data set. As I said, it can run, you know, in JavaScript in the browser. It can also run as a command line tool via node, by the node package manager, you know, just installing it this way. Now, we started with a focus on sort of sharing of raw MRI data, encompassing, you know, structural functional and diffusion MR and a few other sort of extensions from that. But since then, a number of other communities have wanted to basically build BIDS extensions or build BIDS standards for their particular community. So there's actually a number of these BIDS extensions going on right now, and there's a whole process that's been developed for sort of how does one propose and develop an extension. And Chris Gorillewski, I'm sure, would be happy to talk more to anybody who has sort of other extensions in mind. Right now, there's a published extension for MEG. There's ongoing extensions for PET, EG, intercranial EG and eye tracking. Then in the context of other data types, with funding from the Brain Initiative, we've been developing a set of extensions around both processed data, so the results of, you know, taking the raw data and actually doing something with them, which obviously is a much broader, you know, thing than just taking data off a scanner. And also for the statistical and computational models that one needs to be able to actually process the data and to represent processed data. So we're having a set of meetings at Stanford, the first one in a couple weeks and then the next two over the next year, to bring together experts in each of these subfields to discuss, you know, what needs to go into these things. Again, trying to be as sort of, you know, as community-minded as we can. It's sort of funny to have, you know, it's like my lab is a cognitive neuroscience lab. We really focus on trying to understand how the brain works, but we do informatics as well. And it's sort of interesting that I think the thing that we've done in the last few years that people are most excited about is this, what one might think of as kind of a boring data structure about how to like, you know, name your files and stuff like this. But it's really amazing how excited people have gotten about this. I think it just shows how important it is to, you know, to enable the kind of sharing that people really want to do. The 60 number is relatively old. I think it's well over 100 labs at this point around the world who were using bids as their primary data format. It's been adopted by a number of data sharing projects. And there's now more than 20,000 subjects worth of data online and anonymized in bids format that one can just go grab and process. And there's a large mailing list with, you know, a lot of people who are really involved in sort of, you know, developing further extensions and refining the existing format. So if you haven't used bids, another group in the community has developed what's called the bid starter kit, which is basically meant to be kind of an on-ramp to using bids with videos, with tutorials, developed by a number of groups with, you know, chat and forums. So this is something to point anyone who's interested in moving to bids to point them out to help them get started. So I like to think of, you know, bids is a standard, but what we're really trying to develop is an ecosystem around that standard, because a standard on its own isn't very useful until people can work with it and do things with it. So on the one hand, we need ways to get data, you know, in and out of the standard, so we need, like, converters and databases. And then we need stuff that can sort of use the standard actually do useful things. I want to talk about both of those. So there's now a number of converters that can allow one to both get data into bids. A notable one would be a HUDICONF, but there's a bunch of other tools. And then also to get data out of bids, for example, if you need to get it into ISATAB or into, you know, the NIH data archive formats, you can do that. In addition, a number of the databases are now supporting export to bids. Most recently Flywheel just added that as a gear in their toolbox. So it's becoming increasingly easy for researchers to get their data straight from the database into bids format. And the reformatting of the data is certainly the hardest part of using bids. Once your data are in bids, then everything's easy. So what's a bids app? So when we were thinking about how to incentivize usage of bids, we wanted to think about why might somebody use bids other than just liking to take medicine and doing standards-compliant stuff. And an idea was that, well, if putting data into bids makes it easy to do other stuff, that's going to drive people to use bids. And so we started developing this idea of bids apps. And a bids app is basically a containerized piece of software that's aware of the bids standard, such that it knows what files are where. And so basically it's a wrapper that can take, for example, a free surfer and tell free surfer where to look within a bids data set to know how to find the images that it needs to go do its thing. And so we've had a couple coding sprints and worked with a bunch of different software groups to try to help them basically get their software into a bids app. This is just an example. If you've ever installed Free Surfer, you know that it can be somewhat of a pain, depending on the operating system. And so there's a Free Surfer bids app. So if you want to run Free Surfer on a subject, all you have to do is install Docker and run this one command. And we care a lot about reproducibility. One of the side effects of this is people know exactly what version a free surfer was used to run their analysis. And if you're running Docker, then that abstracts away a lot of the potential differences between different platforms that we know exist. And so it helps ensure computational reproducibility from that standpoint. Now, in general, most of what we do has some degree of potential parallelization. You certainly don't, to the degree that you can run a bunch of subjects in parallel in Free Surfer, you'd like to do that rather than running them one at a time. And so we've built in an initial parallelization scheme, sort of a map-reduced scheme that breaks data sets out into individual subjects, and then does whatever we can on individual subjects, then brings them back together, does the group analysis, and outputs that. Clearly there's a lot more potential for parallelization at other levels, but this was kind of the low-hanging fruit that doesn't really require digging into very much to any of the software packages. This is just a kind of a sampling of some of the different BIDS apps that are available. There's now more than 30 of them. Some of the notable ones are Free Surfer, Human Connectome Project Pipelines, and some sort of cool new stuff like Nutmeg and Broccoli. So there's a lot of different tools available where groups have basically built BIDS support into their software. So I'm going to talk about a couple that our lab has developed. And the first is MRIQC, which is a quality control package for anatomical and functional MR data. If you're in those fields, you know that the nature of quality control has been really sort of spotty and inconsistent across groups and across studies. And so our goal is to basically make it really easy to perform high-quality QC. So the idea is you take a BIDS data set, you perform QC on both the structural and functional data, and then provide fairly compelling reporting that allows an individual to pretty easily look through a data set. You get these plots of the quality metrics across subjects, and then you can quickly kind of dig in to this particular subject and say, you know, what might be driving that individual's poor performance on this particular image quality metric? This is an example of what you get on the functional MR output, where you get, you know, motion statistics. You get the carpet plot to let you sort of pretty quickly see if there's kind of large-scale global signal issues going on. One of the cool things that Oscar and Chris added into MRQC in collaboration with the group at NIMH is basically that unless you opt out, whenever you perform MRQC, a summary of the quality metrics gets sent to this server, and so we now have data quality control metrics for over 40,000 unique fMRI scans. These data are all publicly available. If you go to this particular URL, it's an API. You can grab any of these data you want. So this allows you to ask interesting questions, right? For example, like, how does multiband factor relate to temporal SNR of functional data? This is on, you know, thousands of images, and we can see that basically up until, sorry, these numbers are tiny, up until about a multiband factor of six, you don't really pay a penalty at eight. You start paying a little bit of a penalty. So this allows us to sort of aggregate across lots of different groups. Another bid zap that we built is fMRI prep, and this was built with the explicit goal of basically being able to take any fMRI dataset you throw at us and be able to do the best possible pre-processing on it, depending on, you know, what particular data are in that dataset, be it something from 15 years ago, or be it, you know, human connectome project quality data. And so I'm not gonna go into the details of what fMRI prep does. This is also developed by Oscar Esteban, Chris Markowitz, and Chris Korolowski in our group, along with a bunch of collaborators. There's a paper now up on BioArchive that you can go dig into if you wanna see the details. But one thing to mention is something that we've struggled with in developing these sort of tools is, you know, we worry a lot about giving people really powerful black boxes, right? Basically saying, here, put your data into this thing and it's gonna give you out pre-processed data, and you may not actually understand everything that was done with those data. And that's something that we worry about a lot. And so we've tried to, you know, I think that the answer is not to not give out powerful software. The answer is to try to incentivize people and push people, nudge people to do the right thing, right? And in this case, the right thing is sort of knowing where to look to make sure things actually work. So we term this a glass box philosophy rather than a black box philosophy. So that means writing documentation that helps actually educate people to know what the software is doing. And providing reports that help them both visualize and verify that the assumptions are met. And then also guiding them in how they disseminate the results. So this is an example of some of the documentation that walks through, you know, how does EPI to T1 weighted registration work shows the NIPI workflow and also provides some details about, you know, what's actually being done to try to help people understand what are we actually doing to the data? You get reports back for each individual scan, basically saying, hey, this, you know, we did this distortion correction, here's how it worked, right? You can look at the images and visualize how it should have worked. You have examples online of what it should look like. And finally, we provide kind of boilerplate text that says, okay, here's how you should report what you've done. And this allows people to report their results in a way that's compliant with standards like COVID-19, you know, providing a sufficient amount of detail. So, you know, one of the things that we've wanted to do is to be able to process data all the way from raw up through pre-processing out to statistical analysis, right? And that requires that we be able to fit linear models for most of what we do. The challenge is that there hasn't been a, you know, BIDS is a data standard that has not included a specification of what a statistical model should look like. So, with, you know, as part of this, this Brain Initiative project, we're developing a BIDS model specification. Tal Yarkoni and Chris Markowitz have developed an initial version. And Chris has built a BIDS app called FitLens, which basically can go fit a linear model, taking a BIDS dataset, you just tell it which of the models in the dataset you want it to fit, and it will go fit them in an appropriate way. That in turn relies on a really useful tool developed by Tal Yarkoni called PiBIDS, which is basically a Python interface to a BIDS dataset. And he sort of layered on top of PiBIDS as a tool he built called GrabBIDS, which allows you to very easily sort of point Python at a dataset. You can say, you know, get the BIDS layout for this particular directory, and then you can do things like tell me what subjects are in that data. What are the subject IDs in that dataset? For subject number one, what types of data does that subject have? Or if you want to drill down into a particular image, what's the repetition time on this particular image? So it allows you to very quickly sort of grab all the metadata from a particular BIDS dataset. So, you know, a lot of our interest in this project has been in service of developing the OpenNeuro project. So OpenNeuro is meant to be sort of the next generation of what the OpenFMRI project was trying to do, which is sort of, you know, open data sharing. Now we're adding on the ability to process data as well. So the idea is, you know, you have a dataset you want to share, you go to OpenNeuro.org. You say, I want to upload a dataset. It runs it through the validator on the client side, tells you if there's any warnings, and then once you, or errors, once you fix the errors, then it's going to go upload. We have a pretty robust uploader that can handle relatively big datasets. If you have a really big one, please go try to break it and let us know if you can. And so then once the dataset is up, you know, you get a page here where you can, initially it's private, but you can click a button and it becomes public. You get a DOI for your dataset. And then you can also currently turned off, but will be turned back on soon. Hopefully it's the ability to actually run workflows on the data as well. And then we also have discussion. So one of the things that we like to do is, you know, in OpenFMRI, people would give data to us and then, you know, users would email us and say, hey, does this particular dataset have blah, blah, blah? We would have to go back to the authors of the dataset if it wasn't one of ours. So we've enabled discussion on the OpenNeuro pages now that's linked back to the owner of the dataset. So they get, whenever somebody asks a question about the dataset, the dataset owner gets pinged and can go onto the discussion page and answer that question. So just briefly on the architecture of OpenNeuro, right now it's primarily, you know, built around the Amazon cloud. And so, you know, the data come into our server and we're actually using Datalad as our sort of back end for data storage to get the data in and out of both of S3 and other Amazon storage infrastructures. We currently use Amazon's batch, AWS batch, to do a processing, though we're gonna be working on extending that to be able to use the kind of academic high performance computing resources that Yvonne talked about, like the Exceed resources and SeaBrain, hopefully. So OpenNeuro is a completely open source project and so we've been excited that other groups have actually taken it and started using it for in-house data sharing and organization as well, particularly the group at NIMH is using it for internal sharing and has been sort of contributing back, so I think this is a great, you know, we worry a lot about sustainability of these sort of projects because, you know, you're sort of going from grant to grant and to the degree that we can get multiple groups all contributing to development and maintenance of the software that makes the project much more sustainable. So, what's the future of BIDs? Well, as I mentioned, we're currently working on derivative extensions and we are with fingers crossed awaiting a notice of award on a larger grant that'll provide longer term support for the OpenNeuro project, which will also feed back into BIDs. We're working with, particularly with JB, working on developing a governance model so far. The governance has been a little sort of fast and loose, but we're at the point where we really need a more explicit governance model. And also working with INCF, INCF's done a review of BIDs as kind of an endorsed standard and so we're working with them to hopefully become one of the kind of blessed standards for neuroimaging data. Everybody in this room is the future as well. So if you're not already contributing, there's a contributor guide, you can get started, either work on one of the existing ones, help with building new BIDs apps or work on a new version, an extension of the standard. Finally, I just want to thank the community. So Chris is sort of de facto the benevolent dictator for life of the BIDs project, but we've had a bunch of people at Stanford for these various coding sprints. You can see how long the author lists are on these papers. So we've gotten a lot of people from around the world involved and it's been really exciting. And then in terms of support, INCF was really important in giving us support to get started. The Lawrence Honorable Foundation supported our center for several years and we put a lot of resources into developing BIDs. And now NIH through the Brain Initiative is providing support going forward for this. So thanks to everybody in the room who's been part of it and thanks for listening. Thank you for the presentation. I'd like to bring it a bit full circle to your interests. You're talking a lot about the data structuring and infrastructure, but as a cognitive neuroscientist, what goes on in the brain when something doesn't fit or does fit? And how can we use that knowledge to promote the uptake and adoption of these technologies? Sorry, I guess I'm not exactly clear what you're asking. Their effort is to make everything standardized, uniform, aligned in some way. And when things are out of alignment, what goes in the brain, what happens in the brain such that this doesn't fit, this is weird. Oh, right, yeah. So that would be prediction error, right? Or a surprising yes, which we understand very well. No, I think that's, you know, the question of how you drive adoption, I think is a good one. And in part, we focused on trying to make it clear. Obviously, some people like standards because it makes them sleep well at night. They like orderly little boxes, right? Some people don't really care about the orderly boxes, but if I can show them, hey, look, you can run FreeServe really easily or you can get access to hyperalignment or MindBog or all these other cool tools, that's the kind of thing that I think for the people who don't care about the tidy little boxes, can still pull them in. So I think ultimately we have to think about there's different types of people in the world and we have to be able to try to pull in all of them. And just a quick follow-up in the 80-20 principle that you applied, what was unique about the sets that you decided to exclude? That's an interesting question. You know, I don't actually remember details of particular things that we excluded. It was really, yeah, I'll talk with Chris afterwards and we'll try to come up with examples because I don't actually remember any. Yeah, that's a good point. Hi Ross, great talk and it's great to see how far this has come, just in the last year or so. I guess a general question that's gonna be important moving forward is, have you had much adoption by or interaction with some of the already existing larger imaging databases that are perhaps not quite as accessible with respect to interfacing with your tools to be able to really jump forward? I think you and I both know there's a couple of huge ones out there that you can hop into and get very lost in very quickly. Yeah, we've started with ones where we thought we probably have more leverage and have not tried to tackle the whales yet. I agree that, you know, and in part, the hope is that enough of the users of the data sets from the whales will start asking for it and then they'll come to us. I don't, you know, there's enough people now who wanna interact with us that I haven't felt the need to go push, but hopefully it'll happen. Yeah, and just second to that, I mean, obviously they're here, they're everywhere. There are new, I guess, conglomerations of groups putting their data together. How public is your data model with respect to people being able to look at how your file structure is set up and make sure that they can generate something that is compatible before we go too far down the rabbit hole? It's everything that we do is completely public. So yeah, so it should be accessible. We have a lot of example data now. Right. Hi, thanks for your talk. I started neuroimaging about 20 years ago before NIFTY existed, and at the time we were using the Mink format. And one of the advantages of the Mink format was that it keeps a trail of everything that is done to the data set. So I understand that the bits is using NIFTY, which has never had this quality, which in fact was admitted by the Oxford people as being inferior. So how are you addressing the provenance in terms of what is happening to every data point? Right. I'd like to point out that I did not ask that question. I had a conversation with Chris this morning because I knew somebody would ask a Mink question given where we're sitting. Yeah, no, I think we all agree that NIFTY has shortcomings, and that's, you know, we've had to build a lot on top of that. Now part of what I would say is, you know, so far bids has been focused exclusively on raw data, right, so this becomes sort of less of an issue for raw data, once you start processing the data, then the provenance becomes a much bigger issue and that's an issue we're tackling in the context of the bids derivatives project or trying to figure out how do you represent data that have been processed. Our model I think is going to rely on the same kind of metadata structures that we've used, primarily probably JSON for that, rather than trying to use a different imaging format. And you know, to be honest, the primary reason that we went with NIFTY rather than some other format like me is simply that every package pretty much handles NIFTY, right? We wanted something that's gonna work for almost everybody and so yeah, so that's what drove that decision. But I agree that it's, that we have to build on top of it in order to address the provenance issues. Or just realize that or sort of decide that, you know, NIF, that the image is not necessarily the place where that stuff has, where the provenance has to be stored, right, that you can have sidecars that can carry that and then that sort of, that also opens you up a bit, right? Because then you're not limited by a file format, depending on how that file format is built, right? Okay. Yeah, if you don't mind. Okay, thanks, that was a great talk. So I completely agree that for the majority of labs to incentivize them to actually use the format, you need to build tools that help them do what they wanna do. And that's why I think that the BIDs apps are really exciting. So my question is, what have you found to be key in helping people build those apps and specifically guiding them to build apps that are useful not only for their own research, but for the larger neuroscience community? So to address the issue of, you know, how do you get it to just not be people's in-house software? In part, people who have in-house software in general aren't gonna really be incentivized to come build a BIDs app. And so it hasn't really been an issue, but we've also sort of, you know, we've gone out of our way to reach out to groups who we know have a large, you know, user base because we saw those as the kind of the most important ones to help bring into the BIDs app's ecosystem. Can you hear me? Okay. So my question is about the pre-processing steps because I feel that one of the most confusing things in dealing with any data is the pre-processing. So what are the factors which are decision makers for best pre-processing steps? I mean, that's a, you know, I can, so the standard itself is sort of agnostic to what, you know, to what is best, right? It's really about how do you describe what was done? You know, we've built, in fMRI prep, we've made decisions about what we think is best and some of those are data-driven. Some of them are stylistic, I guess you might say. But, but I don't, you know, I think it still remains a, it largely remains kind of a cultural and social issue about how one pre-process their data. It's, you know, what software is being used in the lab you're in. I think I would like to see, so in Oscar Esteban's fMRI prep paper, which is now on BioArchive, he did some sort of statistical comparisons of different workflows and I would like to see more of that because I think ultimately we should be making these decisions based on data, not on lore or on, you know, what software the postdoc happens to be using in the lab.