 So, this is a talk about science gateways. I'm going to do a quick introduction and definition about them. Just to show you that this is not a niche market. Actually, there are quite a lot of people all around the globe looking into this. So, what's a science gateway? A science gateway is a set of tools, data and applications that are integrated in a web environment, behind a web portal usually. And usually these web portals are connected to larger infrastructures to support big computations or large data storage. So, there is now an international coalition on science gateways. It was launched just before summer, actually. And all the major funding agencies and continents have structures for this. It's Canary in Canada, but there is EGI in Europe, sciencegateways.org in the US, Nectar in Australia. So, it's really a big thing at the moment. In your informatics, I guess you are familiar with some of these systems, but the number of science gateways is also increasing every day. And as we speak, I guess, it started at the beginning of the 2000, but now most of the projects and different domains are developing there on science gateway either for computing or for data or in general for both. So, we can name, and this is just a reduced list of projects, we can name the Agave platform in the US, the AMC science gateway in the Netherlands, C-Brain in Canada, the HBP, of course, in Europe, the Lonnie pipeline in the US, Negrid in Europe too. Also, commercial companies have started to go into this business. I'm thinking of Flywheel in California, for instance. And we had, of course, a presentation during the session before on the neuroscience gateway. Regarding data, the situation is more or less the same. There is Coins, Gearder, Lonnie, Loris, Redcap, and XNAT, just to mention the main ones. The thing is that all these platforms are being developed usually for specific purposes, so either subdomains or they have specific features. And it's becoming more and more of a challenge to actually integrate them and to make them interoperate. So, in the next few slides, I'm going to focus on VIP and C-Brain, which are the two platforms on which the remaining of the talk is based on. So, VIP is a platform that we developed at CNRS in Lyon. It has a web portal basically offering two people a way to... So, I'm not sure if this is working. So, it has a web portal offering two people a way to start applications and monitor them. It's based on the European Grid Infrastructure, so it's a big computing infrastructure with more than 100 sites distributed in Europe and beyond. We consume about 30 CPU years every month, and we have applications in a variety of domains related to medical imaging, including neuroimaging. So, the project started in 2009, and as of today, there are more than 900 registered users who use it and publish with the platform on a regular basis. C-Brain now is a project started in Alanevan Slab at McGill. It's basically an integration platform for tools, data, visualization software, and high-performance computing sites. The service in Montreal is based on the leverages the Canadian Grid Infrastructure, so you can see a snapshot of the various sites on the top right of the slide. And it also includes a variety of different tools for data processing and simulation. So, in case you're interested in systems architecture, this is how C-Brain works. It's a very conventional 3D architecture with APIs and clients at the top, so you can access C-Brain either as a human just by clicking on the portal or as an application by using the REST API. The middle layer then has a variety of services. So, to store data, there is a central catalog to keep track of your data distributed on different sites, to keep track of your tasks, to keep track of the provenance information in case you want to see exactly what happened on a particular dataset, etc. The bottom layer then is the resource layer. So, this is one of the features of C-Brain to be able to flexibly adapt to different computing or data providers. So, there are plugins for the main batch computing systems like PBS, SGE, etc. For Amazon Web Services too, in case you want your analysis to run 100% on the cloud. And for the major data providers like, you know, SFTP, SCP, etc. It also has a plugin architecture, so if you need to install it on your infrastructure, but there is no connector for this, you can easily develop one. C-Brain is being used now for the Open Science project at the Montreal Neurological Institute through a connection with the Luris data platform. So, the data collected at the MNI is going to be stored in Luris and then will be processable in C-Brain through the rest of the API I was talking about. Finally, another feature of C-Brain is that it's 100% open source and not only just open source in the sense that you can download the code and die with it, it's also pretty well documented. So, you know, if you have even a small infrastructure in your lab with a few computing servers and data storage distributed here and there, C-Brain can help you manage this infrastructure for your own needs. So, this is all available on GitHub. So, that's it for the advertisement page. So, now I would like to focus a bit more on what are the big problems related to science gateways for new imaging at the moment. The first one I would like to talk about is reproducibility. Actually, a question that I think science gateway developers and users should wonder about when using such systems is how to not break it with a science gateway. Okay, so we are talking a lot about reproducibility, but what can we do at the science gateway level? I'm going to refer to Roger Peng's definitions on reproducibility. So, we talk a lot about reproducibility sometimes with different meaning behind it. So, basically, what I'm aiming at in this presentation is to take the same data to analyze the same data with the same software again and to try to get exactly the same results. So, this is different from replication where independent investigators are going maybe to require another data set to try to replicate the same scientific findings. Here, we focus exactly on the same data set, same software, and we hope to get the same result. So, how can you basically screw it up with a science gateway? The first thing you have to care about is anonymization or de-identification of data. This is a paper published last year from the social sciences, but I think most of it also applies to neurosciences. And this paper shows that the conclusions that one can draw from a de-identified data set are significantly different from those that would be drawn when the original data set is used. Why is that so? Well, basically, when you want to de-identify data, you're going to remove personal information, and then you're going to do two things to generalize, which means, for instance, in neurosciences to replace the date of birth with a year of birth because you don't want information to be too specific and to suppress data sets that are too easily identifiable. For instance, if there is a single subject who is 102 years old in a data set, then everybody is going to know who's going to be the subject because there is only one, so you'd better suppress it before you make the data available to the public. So that's a risk. If you are using a science gateway that implements de-identification for you, you should wonder about it. I don't have a slide on it, but other anonymization methods that could matter, include defacing in medical imaging when you remove the face of the patient or subject so that it's not recognizable, then you may have issues in terms of reproducibility. Another maybe trivial but still important point is about format conversion. This is a paper published, I think, in May this year showing how the conversion from daikon to nifty can create reproducibility issues and can actually disturb a data set. So again, if the science gateway promises to convert your daikon data to nifty, you should at least do a visual inspection of the result to be sure that it's okay. Next reason for reproducibility to be disturbed by the use of a science gateway is the handling of software versions. So this may look like a geeky topic, but it's actually very important. This is a paper published in 2012 showing the effect of the free-surfer version on actual findings, and it was showing that depending on the version of the software that you're using, you could have very different results in your study. So the conclusion of the paper was that users are discouraged to update to a new major release of free-surfer during a study, so you should fix a version. I would add that you should make sure that the science gateway precisely identifies software versions and are not going to make updates without your consent, basically. And I would add another message this time to the software developers is that actually software developers should be encouraged to use version tracking and tagged releases as much as possible because if we're trying to share code in a science gateway and this echoes what JB was saying in the previous talk, if we're going to share code and this code is not even properly version tracked, then we are really running after problems. Well, of course, you're going to say, ah, but everybody uses version tracking. This is so obvious that we're all using it or whatever other fancy version tracking system. Well, this is not quite the case. I've recently worked with the frontslide imaging team on two MIKI challenges related to the segmentation of multiple sclerosis images and PET images. We've integrated 24 pipelines over summer into VIP and among these 24 pipelines, only four use version tracking and zero use release tags. So really, this is a message to developers. If you want your code to be integrated in a science gateway, go to Git, I mean, use Git, use GitHub and you can see even more practical tips in this presentation from Pierre Bélac. As a science gateway, developers or administrators will also hold a responsibility for this. The slide is not very nice, but it's the beginning of our application porting workflow to Cbrain and VIP. Basically, step zero is to check exactly that so that your application is versioned, portable and properly documented and if it's not the case, then we basically advise people to try to improve their code. Still on reproducibility, what else can go wrong? So after anonymization, data conversion, software version tracking, well, there is also something to say about the operating system, the effect of the operating system itself on the results. So this is a study that we published last year comparing the performance, well actually the effect of the operating system on a variety of neuroimaging pipelines on two of the main Linux distributions. What we can conclude from this study is that basically the shorter the pipeline, the less likely it is that there will be reproducibility issues. It's explained by considerations related to numerical stability. So basically if you take a very short pipeline like the FSL fast that I'm showing here, where the goal is to classify brain images into four classes, then we see some differences between operating systems but they can mainly be considered as noise. So what you see here is it's a map showing the sum of binarized differences between the classifications obtained on two different operating systems. So I mean like for instance a red dot of this color means that and this is sorry, this is a sum of 150 subjects. So a red dot means that among 150 subjects one subject had differences at this voxel. So you can see that it's really a noisy pattern. If you take a longer pipeline and the example is on FSL first here, so segmentation of sub cortical structures, you can see that differences can actually be more important. Especially if you look at this yellow region here, you can see that the segmentations are really different under two different operating systems. We measure DICE coefficients down to 0.6 which I mean given that the variable was the operating system is actually quite worrying. We made some detailed analysis to try to explain why, where these differences were coming from. And if you take an even longer pipeline, so we took three surfer in this case and you compare its execution on two different operating systems, then we can even reach statistical significance in some regions. So this is also a bit worrying if the studies is run on multiple operating systems. So that's a message to Science Gateway users and developers that the operating system should really be fixed when conducting a study. So fixing the operating system is not actually so easy, especially if we are talking about systems that use high-performance computing sites and many of them, because the Science Gateway developer or administrator doesn't always control the operating system that are deployed on the different sites. It's up to the system administrators on these different sites. So how can we solve that? Well, the current answer to this is containerization. So it's actually a way to avoid reproducibility issues and I'm going to say a few words on this. So containers have been around actually forever since the 1990s or even 80s, I think. There are very early projects on that, but they recently emerged, especially with the Docker system that I'm sure you've heard of. So why are containers so interesting and useful? Containers provide a virtualization framework at the level of the operating system, which means that to the contrary of regular virtual machines, you're not booting a whole complete machine, but you're using the already booted system, the Linux kernel in most of the cases, and you're just switching the context on top of this booted system, which means that you can deploy a complete operating system on top of an existing booted machine. So why is it nice? Well, basically, containers are booted much faster than traditional virtual machines, which means that we can actually boot one container per task. It only takes a fraction of a second to boot a new container. So this is interesting because you don't have to actually wonder when and how to deploy your virtual machines. You just boot one container per task. This is true especially for Docker, but the ecosystem is very easy to use too. It's available on most systems. It's well documented and actively developed, which means that tool developers can package their own tools as opposed to Sanskrit administrators doing this for them. And containers are also easy to share and search. So many projects have emerged. So Docker was one of the first ones. Today we are talking about singularity and others, and all these projects are federated in the Open Container Initiative. However, containers are not perfect. If you don't like computers, maybe you should check your email now because this slide is going to be a bit technical. So this is just an illustration of how things can still go wrong with containers. So here you can see a very simple Docker file where we build a container from the MR Tricks application. So this doesn't do great things. You just install some packages, then download the MR Tricks code and install it. These are the commands to actually build a container. And this is an example of the container running on one particular machine. So this command is Docker run. You run your application. Everything goes well. You have your data at the end, and the files are here. Now this shows the execution of this container on another machine. So we run exactly the same command. The container is transferred, and then we have an illegal instruction, error message. So this actually breaks reproducibility, and we could wonder why containers behave differently in these two machines. And the answer is actually because the compilation step was architecture-specific, so it was very dependent on the hardware. And as I was saying before, the hardware is not virtualized in containers. So this is one of the cases where containers can go wrong. Based on that, we have started a year and a half ago the Boutiques initiative to help people share tools between different science gateways. So Boutiques is a framework that aims at reducing tool voting time and enabling tool sharing across different platforms and improving reproducibility in science gateways through containers. So the principle of Boutiques are as follows. Basically, all the tools are supposed to be implemented into a container. So we support Docker at the moment, but we are talking about supporting singularity and other types of containers. So this is one thing. You ship your implementation as a container, but still this is not enough because you need to describe to how the executables in the containers are going to be invoked. So this is the role of the second part of Boutiques, which is a JSON language to actually describe what are the parameters of your application, how they are supposed to be invoked, what are the dependencies between these parameters, etc. So such a JSON description could seem straightforward, like you just described the inputs and outputs of your application. Actually, the reality is a bit different because tools can be very complex. Most of the tools that we use in your imaging now have dozens of parameters. These parameters have dependencies between them. Like if you use parameter A, then you are not supposed to use parameter B unless parameter C is defined, for instance. These parameters have types. You can have complex types, like lists of various types. And these parameters are also grouped into consistent sets related to a particular aspect of the application. So our assumption with Boutiques is that we should extensively support this because if the application is well described with all the constraints on the parameter, then it means that we can do more validation at the same schedule level, and it means less errors at the end of the day for the users. So this is just a simple video snapshot showing how FSL-BET was implemented in C-Brain before Boutiques. So we could see that, for instance, for this parameter, users could enter negative values while this is not permitted. So first of all, we had a reduced number of parameters, and there was no real consistency checks between the different flags because it takes time to implement all this in effort, and there are a lot of tools where this is required. And with Boutiques now, so I should say that this is all automatically generated from the JSON descriptor of FSL-BET. You can see that we have much more parameters. We have more checks as well. I'm not sure if it was clear, but if we put a negative number, it's automatically detected that this is wrong. We have also, I'm going to show this, maybe we have parameter groups, and if we click on a flag, then it automatically disables the other ones, et cetera. So the message is, if we all share our tools and describe them consistently, then we can have better validation at the same skateway level. I'm going to conclude the talk by a few words on interoperability. Starting with the note that, okay, let's be realistic, not everybody's going to use the same skateways. The software has an expiration date. At some point, it becomes obsolete. It requires a lot of maintenance, and it just makes sense that different groups are going to still develop their own science skateways for their own needs. So what can we do with that? One approach would be to say we could all adopt a centralized platform for reproducible science. So there are actually a few systems that advocate that at the moment. Of course, it's easy if we all use, let's say Google Doc, like a central system, it's very easy to share your data or your documents with others because everybody's using the same system. You can easily share code, of course. It's easy to rerun analysis because everybody's basically using the same computer, I mean the same platform. However, there are also important scalability issues, sustainability issues, what happens if this platform dies, privacy and governance issues. So do we really want to give control to this central place? Maybe not. So instead, the model that we are looking after is really a model similar to what the web is today. So it's a decentralized network of science skateway platforms. So we envision a common network where different science skateways could emerge and connect to this network and provide maybe different kinds of services. So the types of services provided in this network would be actually different depending on the domain, on the type of data addressed. We could have data storage platforms, computing platforms, tool repositories. Search engines, of course, are critical in this respect. And in this model, the game becomes a little bit different because everybody could start their own science skateway if they wanted, connect them to the network. Of course, it doesn't mean that there wouldn't be central hubs with more power and importance, but things would be more open and decentralized, which I think is required in an open science environment. So what do we need to do this? We basically need two things. We need common repositories of tools and data, and we need common interfaces, so common APIs to consistently and uniformly access all these different science skateways. So there is, of course, a whole talk to make about common repositories and common APIs. I'm just going to mention, again, the boutique system. So tool sharing, I mentioned it before, has a way to facilitate the tool integration in science skateways. Actually, it could also become a way to share tools between different science skateways. Today, we have connectors from the NIAC and NiPype frameworks to boutique, so you can export tools from NiAC and NiPype to boutique, and once the tool is available in boutique, you can import it in VIP, in C-brain, and in the Pegasus workflow engine. A final word about APIs. In the frontslife imaging infrastructure, we are designing a common API for science skateways related to neuroimaging developed in France. So this API is called Cormin, for a common web API for remote pipeline execution. It's still work in progress, but it already allows clients to consistently start and stop pipelines in different platforms to monitor their executions, to access files and directories, to access study, and to administer users. So having common APIs is useful for interoperability. I asked this question to the speaker at the previous session, but I really think that INCF could maybe play a role into standardizing APIs in the different science skateways. And it also opens a whole new set of services. Basically, once these science skateways are connected into a global network, we could think of common benchmarking services, like to benchmark the execution of a tool in different science skateways. We could think of credit services, like for instance, if you want to know which tools are used across the network, we could have services for that, et cetera. So it would really open a whole new set of services. So I guess I'm going to stop here. I just want to thank all these people for their work and support and contributions, and of course you for your attention. Thanks a lot. Thank you very much Christophe, for the comprehensive overview of the issues that are facing us. Questions for Christophe? Thanks for the talk. Very interesting. I definitely agree with you that we need some kind of a re-executable docker container or any other technology, but if we take a pragmatic approach of that right now and say, I have my tool packaged docker container, I want to run it on AGI, on Biomed, or on an HPC cluster. There's no the courage in running on these infrastructures. Do you have any solutions for that right now, or did you start thinking about this kind of stuff? Yes, of course. Thanks for the question. So there are various aspects to it. First of all, there are solutions to run docker on high-performance computing clusters. In Canada, we have one cluster with, I think, 20,000 nodes supporting docker. So I think one of the things we can do is to try to convince system administrators to install docker. The second answer, if it doesn't work, we can still use clouds. I mean, clouds are also emerging, including in AGI. There is the AGI federated cloud where you can basically be root on any virtual machine. Therefore, install docker and run it if you want. And the other aspect is docker is not the only container system. Singularity in particular has been developed, especially for that, so to simplify the deployment on high-performance computing clusters. And there are also other initiatives that allow you to deploy containers without having to have administrative access. So the question is now, today, what type of container system we should use? It could be a bit worrying and, you know, stressing to make a decision now and what's going to happen in a few years. The good news is that the Open Container Initiative really standardizes the ways that containers' formats can be translated from one to another. So, yeah. Thank you very much. Just a little bit controversial. We went to the same problem, we're planning reaction, and like a difference in the operating system. Yeah. But compared to the other products, I guess, the Dyson Dex is very small because the Elibnura is very small. Compared to other products, people relaunching analysis with different parameters and all those things. Do you have a point for why is it really important compared to the other products? Yeah. Compared to the other products, okay, for the Elibnura, the Dyson Dex is very small compared to the other products. I cannot see that as a small part of the problem. I just agree. Yeah. So, I definitely agree with you that this is a small part of the problem, but as infrastructure developers, I think we should get it right and solve it because there will be a problem, and this I can guarantee, that there will be biological problems where this could completely screw up the analysis. Like, you know, if we are talking of die-square efficient of 0.6, I mean, this is really not supposed to happen. So, if you're studying the Amidala, and if you're using, you know, sense gateways that doesn't get it right, then probably your results are going to be completely, completely wrong. So, I mean, I'm not saying that this is a whole problem. It's only an aspect of it, but we should get it right.