 is entitled Workflows for Large Cohort Neuroimaging Datasets, just one slight change. In particular, I'm going to be talking about distributed workflows. And it's a concept I'm going to introduce and provide some motivation for why we've done things this way, as opposed to how people normally do it. In the previous talks, Dr. Pollion has provided some great background and motivation for why we need to basically keep provenance information, keep metadata about all the processing and the analysis we're doing. And I'm just going to add a little bit to that. In a typical neuroimaging workflow, this is not a computational workflow, but the workflow of getting your study done, you start off with acquiring some data and getting money to acquire some data. And it's usually the first step. And after that, you have some storage. You store your data, your neuroimaging data sets, your neuropsychological scores, et cetera. You do some processing on the data, pre-processing, presposing, for example, free surf for FSL, et cetera. And then you take your process data set and apply your shiny statistical model to it and hopefully get some good results. Or you use Excel. I'm not going to judge. And then you get the answer to the most important question in the universe. And then hopefully you get a science paper out of it. And this gray area is what I see as what neuroinformatics encompasses. It's everything, probably including acquisition, but starting from data storage, processing, statistical modeling and analysis, interpretation of results, or modeling of results up to your publication. Now, the problem thus far has been, and you can get away with it if you've got some small data sets or small studies, is that this has usually been stuff. Your boss usually cares about this bit and this bit, and this is your problem. And what has happened is this is done on the desktop, on one computer set. And this is my computer, by the way. This is how it used to be, not anymore, thankfully. You've got a whole bunch of disks connected up with various data sets that have probably been collected over five, 10 years. And you probably don't even remember what they are or what you've done to them. You had some one magic script function or program or a bunch of them that went from there to there. And in many cases, this was done by a PhD student or a postdoc that has since left the lab. And it might as well be some archaic language, because you can't make heads or tails out of what the script is doing. And you can't modify it. So this is something we really want to avoid. And when you get to large-scale studies, and there are a few of them now, such as Able, the ASPRI, the Human Connection Project, this is a big no-no. We cannot work like that. So with the large-scale studies, and hopefully we can take some of these tools back into the smaller studies, we really want to move away from doing things the old-fashioned way and having a proper informatics system in place and automated workflows, standardized workflows, and reproducible results. And these are some of the issues we need to overcome. One is reproducibility. Can anyone else actually do what I did given my data or given someone else's data? Consistency and provenance. Can I redo what I did five years from now? Do I know what the results mean, how I got there? And state-of-the-art, why are you still using SPM2? Why are you still using SPM2? And so our motivation in our lab for getting some of this stuff done was the ASPRI Nero study. So the ASPRI Nero is a longitudinal imaging sub-study of the ASPRI main study. The ASPRI main study is a randomized clinical trial, low-dose aspirin use in a very large cohort. And in our particular sub-study, the two main questions we're interested in is the effect of low-dose aspirin on the incidence of cerebral micro-hemorrhages and on imaging biomarkers of cerebral vascular and cognitive health and edging. We've tried to develop a comprehensive and state-of-the-art protocol with three imaging time points. And baseline, we've already collected data in 550 subjects. Now that's a lot of data to process and analyze and share at the end of the trial. So to give you an idea, so we collect your standard T1 weighted images, flare images, susceptibility weighted images, diffusion, got ASL, cerebral perfusion. With the SWE data sets, we're also doing QSM, which is another degree of post-processing that we need to do with our data. And we need to keep our case-based data to be able to do this, resting state, and some diffusion analysis as well. And one of our motivations for actually developing the set of tools that I'm going to describe is that you naturally end up with very heterogeneous environments in labs due to the people that are working there, decisions that were made in the past. You have lots of compute facilities that may not be on-site or off-site and so on. And ideally, you want to be able to deal with this in a unified manner. And we wanted the tools we developed or the system we developed to be as flexible as possible to make use of all the available resources that we had. And to give you an example of the sort of informatics type set up at MBI, we've got data coming in from three scanners, and there'll be a fourth one very soon, into our central data store, which is Daris. And Daris is like XNAT. It's a centralized medical imaging data set, sorry, database that's built on top of MediaFlux, which is a media asset management system. So it's MediaFlux there. And then it goes off onto some disks and gets stored there. And we also get data from off-site. So we've got some researchers in Monash that are, or even other universities that are using facilities off-site. And they want to store their data together with all the other studies that they're doing at Monash. And they can access their data via Daris using a web portal or onto the computational resources using web download, command line fetch. There's lots of ways of actually getting to your data. So the use of these sort of centralized informatics systems is becoming quite common, and especially being pushed forward by the large cohort studies that I've just talked about before, like ABLE, ADNI, and the Human Connectome Project. But instead of just storing data, we can actually use them to improve their reproducibility and consistency of our process data and related metadata. So for example, Daris and Exynat. And there are a few workflow engines available as well. There's Kepler and Tavana. And for medical imaging in particular, we've got NiPype and the Lonnie Pipeline. And there's some workflow capability built into some of the informatic systems as well, such as Exynat. However, in many cases, we found that the workflows can be a bit difficult to develop, or you're restricted in where you can run them. So you need to have your database running on your compute system, and you can't really move too far away from that. So what we came up with this idea of distributed workflows to increase the flexibility of how we can do the analysis. And the idea basically is that we decouple how and where we store the data from the computational resource and the description of the workflow itself. So the data can be stored in a Daris database, Exynat, or in the file system. And what we've developed is a set of unified ways of accessing these to get the raw data in, run your workflow, get your outputs, and then the process data and metadata goes back. And this could be a description of what your workflow does. And something I picked up while I've been here at the conference is this BIDS project, which is going to be very useful. And when I get a bit of time, I'm going to make sure that our file system access mechanism starts conforming to the BIDS specification. So what we have done is actually develop some of these Python-based wrapper libraries that I've talked about. And the workflow kind of gets broken down into these three steps. So you've got data access. And we use Python and REST, mainly Python. But you can use REST as well with Daris and Exynat. The one really flexible piece of technology that Daris provides, and it's not available in Exynat yet as far as I'm aware, but hopefully will be in the future, is one-time tokens. And this means that when you push your workflow, you can use your credentials to get a token from Daris that provides you access to your data for this one time. And then after your workflow executes and uses the token to get your data, the token is no longer valid. So this way, you don't need to send your credentials or anything signed in all the time with your workflows. So our workflow execution is dependent on open-source tools. The majority of the cases, 99% of the time, we're using 9Pype to describe our workflow. And the 9Pype workflow makes use of FSL, SPM tools that are available in the community. And then whatever tools are available, the job managers on the systems themselves. So Tor, PBS, Sun Grid Engine, et cetera, et cetera. And then we take the results, which are the expected results are also described in the workflow with the metadata and some logs. And then we can upload them back into Daris Xnet wherever whatever database you're using. And again, this is Python REST and a little bit of 9Pype. And this shows not very clearly, but shows the result of one of the workflows, the outputs going back into our data set. And you'll see a QSM image that has been a result of this workflow called QSM 1.1. And together with that, you also get a description of what the workflow did or what the workflow was about. And what I don't show here, but you can come and talk to me about that, is also in the attachments. You have the provenance information that is generated by the 9Pype pipeline for this particular process. So you've got everything together, some metadata, your results, and your provenance information. So we've been developing this tool for about a year or so. We're using this for our ASPRI study. What we're doing as a next step is this MBI automated workflow environment, which is a sort of a GUI user interface that allows other people to use these tools very easily, hopefully. And the idea is that it's designed to be kind of pluggable in nature. So you've got plugins for workflows, for informatic systems, and execution systems. So as long as your plugin provides this uniform interface, in theory, you should be able to work together with the workflow. So yeah, in theory, you could use for any kind of workflow description. But in our case, we try and use 9Pype as much as possible. It's a great tool and provides free provenance information, which is very useful. So just quickly, I'm just going to scroll through a couple of slides showing what the prototype looks like. So you've got all your workflows on the side. You click on one. It gives you a brief description of what the input expects and outputs, where it can run. The pipeline graph, you can pick your information execution systems, putting some details away. So Massive, which Wojtek will talk about a bit later on, is our HPC facility at Monash. So details about accessing your HPC account and which particular data set you want to process. And off you go. And then once it's done, you get an email back saying, this is completed successfully. Or if there's an error, you get the logs coming back to you. So just for some future work, we want a tighter integration with some of the community standards. So as JB talked about earlier, so NIDM and BIDS, we're getting the NIDM workflow part from the provenance information that we get out of our 9Pype workflows. It would be really good to incorporate the NIDM results parts with our standard processing pipelines for FreeSERF and FSL. And I don't know about the NIDM experiment part, because that means we're going to have to get the users to kind of think about this before they get started, but we'll try. We want to finish this prototype for the automated workflows environment and evaluate it and make it available for users of our facility, so both large and small cohort studies. And hopefully, eventually, if people are interested, you know, to the wider community. And Gary's ideal goal is to make it child's play to be able to do this. Thanks. And then, this is some of our colleagues that have helped with work. So Michael, Brian and Tuan from Monash Biomedical Imaging and from Melbourne University, Neil, Amanda, and Wilson. Thanks. Thank you, Panesh. OK, I believe Lunge calls. We've got time for one question. So how many of you are working on this? I mean, is it just you or is it just you? So that's also where having a community of people that aren't developing this and some sort of things and is critical, I mean, to... Yeah, so we are trying to... I mean, there's a few more people interested in this now that we have some, you know, some results. So we are trying to get them together in Australia and then once we get a few things going, then possibly other people. But I'm happy to put it up in GitHub once I have, you know, ironed out some of the issues in bugs. OK, thank you, Panesh. Look, I think it's been a great session. I think...