 So Ben Dichter is going to talk about NWB extension. Yeah. Hi everyone. I'm Ben Dichter. I'm going to talk about an NWB extension for storing results of large scale neural network simulations. This is work I did out of Yvonne Schultes' lab, but it's actually a collaboration between a bunch of different groups. People from his lab as well as the Allen Institute have been working closely with people from Lawrence Berkeley National Lab and with Kitware to help with this. So quickly just to explain what neural data without borders is, it's a data standard that's trying to unify data format across labs that are collecting electrophysiology and optical physiology data. So the way this works is I think of it as kind of a hierarchy where at the bottom you have data storage rules. So this says things like if you store local field potential data you have to also store the sampling rate and what the units of that data are and you have to store what electrodes recorded that data and then if you store electrodes you have to store what the positions of those electrodes are. So you end up with this kind of data dependency tree that if you're a theorist and you're usually getting data from other labs you know how hard it can be to understand someone else's data structure. So the idea is to establish a set of rules that we need when we're transporting data from one person to another and so that we can archive data. So this ends up being a pretty complicated set of rules. So we're building PyNWB and MATNWB which are tools in Python and MATLAB to help us bring data from various formats into neuro data without borders. And then once we have data in this format we can use the IO tools in Python and MATLAB to help us sufficiently build visualization and analysis tools that will then be generalizable across these labs. So this will enable collaborative neuroscience, data archiving, sharing, tool sharing, reproducibility. You guys are here at Neuroinformatics so I think you're all on board with that so I'm just going to move on. So the core of NWB is really focusing on EFIS and OFIS but I'm going to talk to you now about how we're using the extension framework to actually expand upon the scope of the format to store simulation data. And here I'm really focusing on new types of data that allow us to store really the incredibly large volumes of data that are the results of simulation. So I'm working with two groups. One is actually Yvonne's group. They are simulating a million or more neurons with realistic cell type connectivity to end right morphology and synaptic integration with the goal of understanding how spatial memory works. The technology is they're using neuron in Python with mesh patching interface on high performance computers with parallel HDF5. If this doesn't make a lot of sense to you, it's fine. It's just much jargon. It doesn't mean anything at all. And the output data is generally what you would sensibly want from the output of a simulation. Similar to electrophysiology, you would want the spike times but you would also, you have the ability to record intracellular and extracellular continuous variables like membrane potentials and calcium concentrations. And this simulation at this scale allows you to have both very large scale recording from many neurons at once, but also very detailed information about specific neurons if you're interested in how specific compartments behave with individual cells. So which results in a very large amount of data. So if you went to Yvonne's talk yesterday, you know that this is a really impressive project and you also know that this slide is a really bad summary of that project. So I hope you were able to attend his talk, but we don't have time to go into the details of the goals of the project themselves. I'm going to focus on how to store the output. And so I'm going to do an equally bad job of explaining the Allen Institute simulation effort of trying to simulate in vivo like conditions for understanding visual scenes. So what actually the two groups started this effort apart from each other, but settled on very similar technology. So in red here, I don't know if you can see in red is different, but the black is all stayed the same because it's actually, they're both using neuron, Python, MPI high performance computers with parallel HD of five. And the output data is essentially the same for the both labs. So we discovered this and about each other and we decided, well, let's team up and try to figure out if we can establish a an output format so that are the tools that we build are compatible with each other. So the goals for these tools are going to be again, we need to store spike times and we need to store continuous data like membrane potential and calcium concentration and the requirements of this is that it needs to scale very well to a very large number of cells. It needs to enable efficient parallel read write and we'd really like it to be compatible with each other as with other groups as much as possible and with experimentally derived data as well. Okay, so here's the structure that we settled on for spike times. We have a cell ID which has a one to one relationship with a range reference matrix and that range reference points to the spike times. So these three arrays actually hold information for every single cell. So no matter how many cells you record from, the number of arrays does not scale with the number of cells and that's really essential here because that allows us to scale to a million neurons with only three arrays defining all of the spike times. So here to just illustrate it for the blue entries, that's cell of cell ID zero which maps to spike times zero through five and then you just look that up on the spike times array and that gives you entry zero through five of spike times. And so this allows you to have a different number of spike times for each cell and you can just add to this as much as you want this. So it scales to a large number of cells because you only have three arrays and it allows for efficient parallel read write. One of the ways that we do that is we have this cell ID array which might seem redundant because it's just counting up but this actually allows you to write the data out of order. So if you have cores that are representing different cells and you don't know which core is going to end first, this could have been one and then zero. It doesn't have to be zero and then one because then we can go back and look at the order and say, okay, they were written out of order. Okay, so I should say that this solution is very, is heavily inspired by work that was done in the Schultes lab before I came into the lab. My work has really been writing an extension to NWV to port this technology into neurodata without borders. And the other thing we need to do is store continuous data. So here we're going to use a similar strategy where we've got the cell ID in an index pointer. And this index pointer is actually going to map to compartment IDs which are physical locations of cells. So similar to spike times, if you're marking physical locations of cells, you want to have the ability to mark a different number of physical locations per cell. So we have the same kind of lookup strategy where the index point are maps to compartment IDs which give you the physical locations within compartments. And then once we have those compartments labeled, we can use just a standard matrix time series where this is time by the number of cells times the number of compartments, the average number of compartments. And the way this is kind of integrated into neurodata without borders is just to kind of give you a taste of how this works. This is like an object-oriented way of storing data. So this time series object inherits from time series measurement type, which means all of these attributes are going to be a requirement to store this data. You need to say the name, which is basically what it is, like is this membrane potential, the source, the sampling rate, the starting time, the unit, the conversion, and the resolution. And so those are the essential metadata that you need in order to analyze or understand a time series. So again, this scales up to many neurons because we have five arrays here. And that allows us to fit data from any number of cells. And it allows for a parallel read-write because you can write this out of order if you need to. This is heavily inspired by the structures that are already in neuron. And it's specifically designed to be compatible with Sonata, which is a format that the Allen Institute has already been establishing with BlueBrain. So the idea here is we can build extensions in NWB with a explicit goal of being very easily interoperable with other formats. We don't necessarily need this to be the simulation output format to rule them all as much as I would like that. I recognize that's not realistic all the time. But one advantage that this has is that it allows you to store simulation output data in a way that's very analogous to the way that you can store electrophysiology data. For instance, you can apply minimal changes to the script, to the analysis script, and run analysis on both. OK, so now that we have this extension, I'm working with the NeuroData Without Borders core development team to establish an extension sharing framework. So I think that this extension, if you notice both of these data types, it's not necessarily problems that are specific to simulation. They're actually more problems that are specific to scale in both of these cases. So I think we're starting to solve problems in the data storage space specifically for simulations that are starting to become problems that experimentists are having. And we'll continue to be even more problems as we progress. So we want to be able to share these extensions so that anybody who needs them can use them. So in this extension sharing framework, it's not quite ready for primetime. It hasn't been released yet. But what it will include is the specification itself, so the definition of these data dependencies, the code to generate it and to use it, and documentation. And the idea is that this will actually be able to version this and then get community feedback, community involvement. And as a community, we can establish an extension for storing simulation output data that satisfies everyone's needs. And this extension sharing framework, it's not just for simulation output. It's for opening the NeuroDataWithAppWater standard in any direction that's necessary for storing the data that your lab needs. OK, so I just want to finish by saying this is basically what I do. I help labs bring their data into a common format. And I'm really interested in talking to you if you're interested in leveraging this type of technology. Just learning more about NWB. If you want to convert your own data or build tools for NWB, that would be great. Definitely talk to me if you want to do that. You want to build bridges from other standards, like bids, for instance, to NWB. Please find me or email me. I'll be around. I'd love to talk to you. And I just want to thank Kavli and the Allen Institute for putting on this hackathon, which we took a picture of here. A lot of the collaborative work was done in this hackathon, so I really want to thank them for putting that together. OK, thanks. So have you looked at the published specification for NSDF, Neuro Simulation Data Format, which came out a few years ago? I have not looked deeply into that. Right now, the goal was to make this as compatible as possible with other experimental groups that we're working with. But I need to look more deeply into that format. Actually, if you want to discuss it, I'd be happy to learn more about it. OK, sure. Happy to do that. The other thing was a question, which is I don't understand how by staggering multiple electrodes, so to speak, into one array, are you able to handle streaming data? Yeah, so streaming data is tricky. So there are kind of two ways of storing spike times. One is you could do it by cell, and the other is by time. If you're doing it by time, then you're able to just append to that as much as you want. It's very easy to stream data. If you're doing it by cell, it's not 100% straightforward how you would stream data. And so that's the weakness of this approach, for sure. So there's a design decision down to which thing we wanted to be more efficient to query over, whether we wanted to make it more efficient to query over what cell the spike belonged to, or the time region, and all cells of a certain time region. So we chose to make the querying most efficient over cell. And the way that we, it's a little bit hacky, but the way we would handle streaming is we would not require the cell ID array to be unique. So you could have cell one could be in that cell array twice. So you could basically have a buffer that can say a second of data, write all of your cells, and then for the next second, you can have another one that has another pointer to more spike times. It's not ideal, but it is a way to handle that problem. Nice talk, Ben. Would you say NWB is ready for prime time? Would you advocate that all of the experimental labs throughout the world should start converting all of their data to NWB? Or do you think there's still a lot of iteration that needs to be done before we can advocate widespread adoption of NWB? I would be firmly in the middle between those two positions. I definitely wouldn't say that you should use it as a backbone of your lab today. But I think it's really valuable for labs that are getting involved now because we're in the final stages of the malleable state where we haven't quite gotten to the point where we are 100% guaranteeing backwards compatibility because we want to be flexible enough to handle the issues that are coming up with labs. We have a timeline, which is the goal is mid-October to release an official thing, and say we're going to support backwards compatibility from here on out. I think that's an ambitious timeline, but it's soon, it's coming. We've been working through a ton of bugs, and now it's like it works. I mean, I have been able to really get labs into it in a way that works for those labs. So I would encourage you, if you're interested in using NWB at some point, to get involved now so that we can see what problems your lab faces. And we are in this rapid development stage where we can fix those more easily because we're not guaranteeing backwards compatibility. I'll go to the one right at the back. Hi, I haven't worked with this type of data before, but I have worked with other data where the sampling rate is sort of a best estimate. So does NWB handle that case where your timestamps might be noisy or? Yeah, yeah, yeah, absolutely. So that's a good question. So the time series data type, it has the option of storing your times as a starting time and a rate, or you can store them as timestamps. So you must do one or the other, but if you have some kind of irregular sampling rate, we are totally able to support that data. Thought about kind of downloading data that is like from CRCNS database or what is the other that Eli uses, and uploading them into your database and how does that work? Because we spend a lot of effort on taking other people's, when working with other people's data, and yeah, so it would be helpful if the data sets that are already there were reprocessed somehow. Yeah, so I have had some requests from specific people who have posted data and they're like, hey, if you wouldn't mind downloading it, converting it and putting it back up. I mean, that's something we would definitely be open to. Right now, the reason, the inspiration for this format in the first place was it is so difficult to understand data sets if they're not following any kind of standard. And in my experience, you're almost guaranteed to be missing some crucial information, and then you need to have months of communication with the experimenter in order to actually understand it. So what I'm focusing on right now is actually working with experimentalists as they're trying to put their data up, because now they're there and they can help me answer the questions. I think it totally would be possible for some particularly well-documented data sets to do this type of thing, to download them, to convert them, and to re-upload them as NWB. Yeah, I mean, what we'd ultimately like to do is to make this such a usable tool that when you upload data, it's relatively straightforward to convert it into NWB and we're kind of tackling it from that perspective, just moving forward as opposed to going back through the endless archives of data. Okay, I think we should now move on. Thank you very much.