 So, I don't really feel like I should be here because I was supposed to be like a rock star and it didn't work out, so I do computer science instead, it's a long story. But there is a happy ending, I get to hang out with the INCF crew, they're so awesome, it's like a great place to be. So this is I think a good solid ending to my sad story. Basically the title here is very long but it can be summed up in data sharing or actually like a prelude to data publishing maybe. So that's what I'm going to talk to you about a little bit of in terms of our context, we're trying to do open science in Montreal, we have all sorts of initiatives and one of them is COMP. That stands for Canadian Open Neuroscience Platform. But before I get into that I want to kind of say a few things. So this is a slide I stole from JB who I think stole it from Mary Ann and I don't know how far it goes but I've cited the people so there we go. Either way it's kind of, it's a very telling slide in my view because a lot of the reason why we're trying to do open science is precisely this here. So yes we're trying to enable scientific discovery, there's all sorts of reasons that have been cited already but you know even if I was just a bureaucrat of some kind and I saw this figure that would be very telling, right? 200 billion maybe is an estimation of dark data, data that never really sees the light of day. That data is often lost you know like so that's a huge waste on the system, it's a waste for us, it's money that could be better spent. So this is one of the overarching reasons why we are doing open science. This is the platform that we've launched about a year ago. It's an 11 million dollar grant from Brain Canada. It's a three year mission and we're about one and a half years into it and our goal ultimately is just to share data. I'll tell you a little bit more in terms of the details of what we're exactly doing but a couple more overarching principles I won't spend any time on this because I think everyone here has heard a lot about FAIR. It's definitely a great idea and a concept and I'm happy that it's been quantified or qualified. So that's something that governs what we do. Sort of pursuant to that we did, we wrote our own best practices for you know neuroimaging and they're based I guess on the FAIR principles and they work really nicely with like governing bodies like INCF who do standards and best practices so we've kind of tried to translate that into something that people, the average scientist can actually use. So that's the preamble, now the actual plan of what COMP is. So this was when I first started like almost 20 years ago in this field this was sort of my reality and I think we've come a long way. The current plan is this up here which is a lot of things but basically it just illustrates the amount of technology that we have used, borrowed, created to come up with a design that hopefully will work to share data. So in more simple terms the actual central repo that will sort of house, to be your portal effectively to get data from the Canadian Open Neuroscience platform is based on you know git, git annex, datalad and a bunch of standards. So this is what it looks like, I'll get into a few more details about this in a second. There's other low hanging fruit that is pretty useful I think. So something called the dats model, it's a way to describe what we use it for is a way to describe data sets. So you know it's important to have metadata and provenance for each individual piece of raw data but to start, so this is not hard to do, everyone who has a study or a data set of some kind could probably describe it in a very concrete way that can then be used to be automated and you can like search for it much more easily and share it among other repositories. So this is one of the things that we're pushing as low hanging fruit. There's other considerations for our platform. So you know there's open data sharing that's really easy, you just put data on a website somewhere in terms of the at least the ethical issues or the concept of it but it's you know that you have to first of all get ethics for that, that's not that easy but the technological implementation of it doesn't need to be too complicated. There's closed data which was a bunch of data use agreements, that's been science for the most part and you have to fill out all these forms and have all kinds of complicated bureaucracies. We're introducing a new model called registered access where you can be authenticated by members of your community. So we're working on this in terms of exactly how to implement it but it's something that's going to come to fruition in the next few months. I also stole this from JB but it's something that's very much part of our experiences. So we kind of have this anecdotal gap analysis. Just with all our experience we know exactly some of the problems that have occurred. There's articles about the cost of fare for example that you could use as a reference but data is still very hard to find and sometimes even harder to access. It's not usually open or very shareable. It's not well documented and it's not well standardized and it's still difficult for researchers to share. If you're an individual and you just want to try to share your data you run into problems. Another key is that the data is not persistent or as you've heard many times reproducible. So we have some governing principles that we've adopted. They are along the lines of fare but the way that we're going to deal with data is it has to be distributed. So you can have your own repo and then we'll crawl that and try to make the data available. The governance of the Canadian Open Neuroscience Platform is also distributed model. There's many players with their own databases, different kinds of ways of sharing data so that's important. Portals will have to have direct access to the metadata so we'll make the metadata and data available to everybody. The metadata has to exist for both the files and for the datasets as a whole like I was telling you about the DATS model. Tools need to be versioned. We have checksums and other kinds of hashes that we use to ensure the integrity. We are very conscious about privacy regulations. We've already released human data which I'll tell you about in a second. We want to make sure that it's absolutely de-identified and in line with any kind of ethics and privacy regulations. We want to make this available through a UI but also for all the geeks out there with APIs and make it easy to share data. Let me tell you about what we've actually shared so far. We had an annual meeting in Toronto in March and May I think. This was our first dataset. This is what we actually made available and open or at least registered. The Open Prevent AD, it's totally open. You have to create an account just for our own tracking purpose to see who downloads what and try to mold to that. It's open data. There's no permission that you need to get. It's an Alzheimer's study and there's about 500 subjects. There's all kinds of imaging data, full battery modalities at MRI, MRI, you name it. Some behavioral metrics. There's more data coming. Some of it will come under the registered access. This is a very, it's a pretty big milestone in our field. It's kind of like ADNI but totally open. There's some phantom data. There's a researcher that actually has scanned himself like 73 times here, I think. I don't know why. I don't know why anyone would want to do that. I did that by accident and I've released my data as well. I got signed up for this maybe 15 years ago. Someone said, hey, do you want to just do a couple of scans? You can go to Seattle or something. It sounded great. 15 years later and hundreds of scans, hours in the scanner later. I don't know if I would ever recommend that. But now that I'm there, I want the world record. I want to hold onto it, so just saying. There's other data that we had already released before but it's under the context of COMP as well. It's the big brain. This is the highest resolution brain, 3D reconstruction of a brain based on histology that's ever been created. It's at a 20 micron resolution. Some of you may have heard about it already, but that's available as well, fully open. There's also an EEG initiative that we've released in Montreal. We've created a cool viewer where you can overlay the electrodes on the different regions of the brain and you can see the signals for every one of those electrodes. That was nice. This is what the portal looks like. We haven't officially launched it yet. There's a prototype that we've worked on, but sometime in the next few months there'll be an official launch and you'll hear about it. This is some of the other data that we're going to release soon enough as well. Just to show you that the scope's increasing. I have all sorts of people coming and interested in sharing data. If you're one of them, please come talk to us. The other parts of COMP, there's many parts. There's training, which I won't speak to too much. I'll talk more about the infrastructure. I just explained about the data, but there's also computational infrastructure that we have. This is just an example of a pipeline to show you the complexities of some of the things that we do and house. I'll start again with the low hanging fruit. Boutiques is a way to describe pipelines in an efficient way. This was created by Tristan Glattald. A lot of times for those of you that are into high performance computing, you will install code pipeline software and it can get kind of tedious. If you know what you're doing, that's great. But if you wanted to make other people share your pipelines with others, it's hard to recreate that on another HPC. So he's created a standardized system that you can describe your data set and then you can use that in an automated way. So that's easy enough to do and then you can containerize your pipelines and go from there. There is a back end that we've created for many years in Montreal called C-brain. It's a high performance computing environment that basically abstracts the complexities of all that stuff that I was talking about. And you can be kind of a naive user. You can be a neuroscientist without much computing experience and still be able to launch your pipelines using the portal. If you have more experience, then you can get into the weeds of it. So this allows you to, in a very distributed way, launch a pipeline from Tokyo with data from Montreal and then output it into Stockholm or something like that. So this is quite a powerful tool and that'll be one of the back ends for COMP. There can be others. And I stole this from Dell when I was hanging out with them. But one of the things that drives what we do, I think, and will drive us much more, is AI or deep learning machine learning techniques. So this is just to illustrate the complexities of the kinds of tools that they have and use and things that will be probably, you know, like experiencing soon enough. I think the deep learning movement will very much affect the open science movement and big data that we deal with. So this is worth noting. Don't tell Dell I stole it from them. So there's more. This wasn't necessarily that easy to accomplish. So there's a whole process into this. First and foremost, if you want to share data, you have to have ethics. And this can be really complicated depending on what country you're in, depending on your ethics board. There's so many factors that influence it and they're not necessarily the same rules everywhere you go. There's some overarching principles about how to, you know, get ethics. We have created, in getting registered access, for example, we've created a template that I think a lot of institutions can use. We often get asked about, you know, like, well, I would like to share data, but I don't know what to tell the ethics board. So, you know, we have this ethics framework that maybe can be adapted. And I find people are more, if they see someone's already done it, they're more willing to actually follow your lead. There's also curation. People were talking about it at the end of the seminar yesterday about, you know, the complexities of this or why isn't this, why is there not more value in focusing on the curation? So this is something that I do for sure. And it is a very, there was a lot of work that went to curation. There's a lot of sides to it. I could explain more maybe during the panel session if anyone has questions about exactly what's involved in curating data. But there's a fair bit and this is something that I think we've gone through the full process from acquisition all the way to dissemination. Another important thing is quality control. This is something that everyone needs to do. If you're sharing open data, some people like to just share data that's maybe not curated very well and that's a perspective. But a lot of people appreciate when the data is cleaned and there's not too many artifacts in them and it's well organized and there's all sorts of quality control. In fact, Pradeep here, wherever he is, he's starting this whole INCF quality control standardization effort. So that's something that I think will be very valuable. And so regardless of what your perspective is on QC, I think it could use some standardization. Also, interoperability. This is one of the themes of the Canadian Open Neuroscience platform. As much as I would like, you know, Loris, for example, to be the database of the whole planet, that's probably not going to happen. So what is key is to have interoperability between all kinds of different systems. There is standardization that INCF, I think, could very much support as well in terms of the APIs, there's so many different ways to transfer data in an automated way. Having standardization in neuroscience would probably benefit everyone in this room. So these are other central core initiatives that are part of COMP and we are pushing them. There's DataLad as well, which is part of the technology that we use to house metadata. We've written crawlers for it. It's a software that's written by Michael Hankey, who's amazing, and smart, and cool. And Yaroslav Halchenko. Sorry, I told you I was going to make fun of you, so there you go. He's smart too. Anyway, so this is something that probably could be used by a lot of people. It's not just our context. So I would encourage you to investigate, to see how you can, you know, adapt DataLad to your own data sense. One other little thing that we have created is our own checklist of how to make data go open. We've done it within our own context. It's right now a little more centric to our experiences, our software, our data. But this is something that will generalize even more. It was based on our first round for PreventAD. It took a long time actually to curate that data. I think a lot of people underestimate the amount of time it takes to, you know, totally organize and clean and, you know, go through all the different steps to make your data go open. So this is our first pass. We're going to probably release it as a, maybe as a pre-print or something like that. So this is coming. And there's one last point that I guess I want to make and that's that one of the goals is also about data publishing. So I think you saw this slide earlier. But basically there are initiatives, different kinds of new journals like the one at HBM that is looking to potentially share data that's more along the lines of data publishing. So it's not just about, you know, the traditional scientific journals, but, you know, sharing everything that you have, your notebooks, your data sets. And the underlying principle, this came up, I think I was drunk, but it was over beers with Pierre Belec and Nicolas Stikoff, who are the communications wing of this whole, of the publishing side for a COMP. And it's an idea called neurolibre where we basically create the underpinnings for what any journal, any open journal could use to sort of organize your, the scripts that you use, organize the data and run these things in a more automated way, you know, from like different places. So I can give you more details about this if you want, but this is something that has already been done for COMP and we've released it's all open source and it's gonna become more and more elaborate. So for those that are in the publishing industry, you may want to take advantage of these kinds of concepts. So that's another thing. And I guess that's basically it. So, you know, for those of you who wanna discuss more, I will be at PubLolec later. So feel free to join and thank you for your attention.