 to present today what we've been doing on open and fair data specifically for COVID-19. So, within Elixir Belgium we've spent the last year quite some time on improving the fairness and openness of COVID-19. Data, and we did this in the context of Elixir, and I'm head of note for Elixir Belgium. And just to introduce this, Elixir is an intergovernmental organization. So it has notes across Europe and 23 countries, and its aim is to bring together the bioinformatic resources that are available in all of these countries and make one coherent ecosystem out of them. And this spans databases, tools, training materials, standards, computer resources, basically all the infrastructure that you need to run bioinformatic analysis. The way Elixir is organized is that there are five platforms. These combine the technical experts in all these notes around, for example, tools or interoperability. And we have communities which ground us into the actual research and make a link with the researchers. So the technical solutions that are developed or agreed upon in the platforms are tested and grounded in actual use cases in these communities. These communities are very broad from plant sciences, proteomics, metagenomics to different aspects of human data. To make the data actually open and fair, Elixir has many different ways, and doing this is really the reason why Elixir exists. So there's a number of core data resources defined. These are services such as, for example, Uniprot or European C, which would severely affect the majority of life science researchers if they would disappear. Next, we have the data deposition databases. This is the mechanism that we use to make data publicly available for all researchers in Europe and beyond. The European Nucleotide target will feature later on in the presentation and is one of many deposition databases. Also, interoperability is a really important aspect. We have a list of recommended interoperability resources. These are services that provide added value or registries to improve the interoperability across all of these bioinformatic resources. Fairsharing.org is one prominent example there. And there's a number of other registries to collect, softwares and bio tools, workflows in workflow.eu. We'll talk about that later on. But also containers, training materials and the standards that I already mentioned. So Elixir really tries to bring all of this together and make this as fluent and seamless as possible for researchers to work with. Then, switching gears towards what the COVID-19 data platform is. About a year ago, the European Commission together with EMBL-EBI, the European Bioinformatics Institute started this project, the European COVID-19 data platform to gather relevant data on COVID to facilitate research on those data and try to get us out of this pandemic as soon as possible. This is grown to a elaborate collaboration across Europe and beyond. This works together with the national infrastructures because not all data can be shared publicly in a central repository. This works together closely with the European research infrastructures. Elixir is coordinating this, but also many other life science research infrastructures are involved, for example, on clinical trials or biobanking. And we are now expanding this also to other disciplines such as social sciences or epidemiological and healthcare data. And also, this links to the global initiatives everywhere in the world. Things are happening. So we are also making sure that this aligns with what's happening there. And all of this together allows us to accelerate COVID-19 research by making data available and accessible to researchers. We do this in a number of projects. We're specifically involved in Elixir Converter on data management and EOS Live around the analysis. The COVID data portal has a lot of different data types. It started mainly with the molecular ones. We're now also expanding into social sciences, for example. Today we'll be mainly talking about viral sequences as that is what we've been focusing on here in Belgium and within Elixir Belgium. The focus of the presentation today will be how we can make data available into the COVID-19 data portal and specifically ENA from Belgian Research. And on the other hand, how we can use the data that's available there to analyze it through Galaxy and workflows that are publicly shared. And with that, I give the word to my colleague, Miguel, to take you into the submission of viral data. Thanks, Frédéric. So like he said, we're going to focus on sequence data, specifically sequence data submission. And right now, if you have SARS-CoV-2 sequence data, there's two places where it can be deposited. The first was GISTED, the global initiative for all influenza data sharing. And there's also the European Nuclear Archive, the ENA, which is also part of INS-DC, which is a global collaboration between nucleotide repositories. So GISTED takes consensus sequences of SARS-CoV-2 and it's tied with some neat visualization tools like NexStrain. The access to GISTED for both getting data and submitting data is via registration by mail. And it also has quite a few restrictions on data reuse. The ENA, on the other hand, is fully opened. It has a more sophisticated metadata model that goes along with the data. It takes very different types of sequence formats. So it will take the consensus, but also the raw reads. It will take annotations and genes. And like Fredrik mentioned, it's also integrated with the COVID-19 data portal along with other platforms like Galaxy. So what we observed was that in Belgium, most researchers were submitting their SARS-CoV sequences to GISTED and none to the ENA. And so our objective was to change this, so to increase the number of SARS-2 submissions to the ENA. And for this, we decided that we would lower the technical barrier for bulk submissions, so for large numbers of sequences. And we did this by developing a tool and featuring it in Galaxy, which is a user-friendly environment, and also by changing and simplifying the metadata input, which can be a bit challenging for users. We also, in parallel, using this tool had a brokerage service for Belgian researchers in which we identified labs that were submitting sequences to GISTED but not to ENA and offered to broker the submissions to the public repository to ENA. And we did this for the Institute of Tropical Medicine as well as for the ULB. And as it is now, these are the only Belgian raw sequence data you can find on the ENA. So what does the submission tool look like? Well, there is a command line interface tool, the ENA upload tool, which was developed in collaboration with Beyond Groonings Group at Freiburg. And this tool takes data, so the sequences, and the metadata. So this is data describing the sequences, such as location, age of the patient, and so on. And it has a one-step submission process, which is already much simpler than the current submission process to ENA. In order to further simplify the process, this tool was wrapped into a Galaxy tool, which is an environment that is much more user-friendly. And now we'll discuss next what a Galaxy is and how it makes data analysis and submissions like this easier for the user. So what is Galaxy? Well, it's a data analysis platform. It is web-based and it's very easy to use. It is free. It is open source and has many, many tools, over 8,000 tools at this moment. And as you can see from the graph, it is increasingly popular. It's reflected by the citations of the user Galaxy in different papers. And on top of all this, there is support for users in the form of tutorials, trainings, and so on. Like I said before, Galaxy is easy. It is web-based, so there's no installation needed. And you don't need to have any bioinformatics background or programming background to be able to run complex analysis tools and generate visualizations of data. It is also reproducible. So Galaxy will keep a track of all the analysis done to the data and it is reusable and transparent in the sense that users can then share all these steps which are called a history. They can also share workflows that they develop and they can share visualizations of their data. And the technical side is also scalable. So Galaxy can be run from laptop. It can be run from your institution's servers. It can be run from the cloud. And as Federico mentioned, it is cross-domain, so it started up by having bioinformatics tools, but it's now expanded tools including chemistry analysis, ecology, climate science. You name it. All sorts of disciplines. You can do analysis with Galaxy. So this is what it looks like. There's three panels. The one on the left has all the tools that are available and a little search engine to find the one that you're interested in. The panel on the right is called a history panel and has all your data objects as well as the results of your analysis. And the middle panel is where the tool that you're running is and here you can choose what data to analyze, change the parameters of the tool and also execute. So which tools are available? So we're 8,000 tools as of February this year. Here I made a little top 10 list. Yeah, the most popular one are still bioinformatics tools, but you can see that even statistics, visualization and text manipulation tools made it into the top 10. And Galaxy has something called the IUC, which is a group of people that maintain high quality tools and also support tool developers with best practices. Tools can be concatenated or combined to make workflows. And in Galaxy, this is very simple because you can do this via graphical user interface and Ignacio will discuss workflows in more detail next. Like I mentioned before, the Galaxy Community spans very many different disciplines and some of these will actually have their own dedicated Galaxy instance, which will be populated with the tools that are relevant to that particular discipline. If you would like to use Galaxy, there's three main servers that you can access, public servers. We also like Ignacio looks after the Belgian Galaxy instance, usegalaxy.be and there's also many other domain specific Galaxy instances that you can try out. And the last thing is that there is support for users. There is something called the Galaxy Training Network, which has collected and developed tutorials and trainings covering, well, here's 21 topics, both for users and for Galaxy developers and administrators. And the one final note would be that V.I.B and Alexa Belgium are organizing the next Galaxy Community conference in games, but of course, given the circumstances, it's going to be held online. And so now I pass on to Ignacio will discuss the use of Galaxy for COVID. Thank you, Miguel. So I take over from here. I'm going to talk about Galaxy specific analysis for COVID data. So the first thing that I want to discuss is why we choose Galaxy to analyze COVID data. So for us, it was quite an easy decision because we have been using Galaxy for quite a long time. But if you look at it in a not subject way, you can see that Galaxy is really a platform that is oriented towards analyzing open data. So it has a lot of tools and connections with open data repositories, like for example, ENA and all that Federico already mentioned. It's also quite user friendly and it's aimed towards biologists and people like from web lab that have no idea about how to install it or how to install dependencies but can really generate biological knowledge out of open data. So it's also deploying many countries, many nodes, especially on Elixir. So you can already, as a user, just get the resources required to do the analysis. So you don't really need to have them available in your lab and you don't need to use your local resources. And it's also a production-grade service. So it's been used for quite a long time, as I said, not just by us but a large community. So there's nothing really that needs to be developed in terms of the analysis platform. We just need to encourage people to submit and to publish high quality data to the open platforms, as Miguel mentioned, to these submission tools and then we take care of the analysis. So as part of the community effort that Miguel mentioned before, since there are quite different type of communities in the Galaxy environment, as soon as the data start to flow last year at the start of the pandemic, the reporters started being populated with open data and the community started developing different kind of workflows to analyze this. So this was kind of easy to do because we just created a single repository where it describes each of these workflows and it acts as a repository for the workflows themselves. So anyone that just goes into covid19.Galaxy.project can access the description of what each of these workflows do. They are highly creative. They have a lot of versions like evolving time and you can just export them and start using them with your own data or with all the public data. So this is just an example of what things are deposited there. But as you can see, it goes from public data towards a really reporting of what knowledge you extract from the analysis. So what's the idea of this? So the idea is that, yeah, shortly after the open data started to flow into the repositories, we quickly developed a lot of workflows that are publicly available. We put them in a repository and everyone that has some biological knowledge and can take care of this, they can run it in those public instances. So what we mean with this, the key aspect of this is that we are trying to demonstrate one of the big aspects of open data. And that is that it can really accelerate the development of knowledge using it. That is because we are using an open platform for data analysis and that means that anyone can make analysis using it. So this is basically the idea that is explained in this paper. Okay, so I mentioned that we put everything in one repository and we are sharing these workflows with the community. This is simple because we are focusing on one type of workflow which is related with COVID-19 data analysis. But what if a researcher wants to do some kind of analysis related with other data sets and with other kind of analysis? And then this is a problematic we already know beforehand, but it really accelerated and it showed up during the pandemic because people started to want to do more and more analysis and they wanted to search for workflows and they wanted to look for workflows that were really high quality and they can reuse. So with this in mind, of course, this idea of creating a single repository is not going to escalate to create repositories for every kind of workflow and for every kind of situation. So with this in mind, we collaborated with other research groups to develop what I'm going to talk about, which is the workflow help. So what's important about workflows? So if you look at publication and if you even just go to conference and talk with other people and they mentioned what they mean with the workflow and how did they actually achieve a biological analysis that is published, for example, you will see that a workflow can be anything. I mean, it can be from a batch script to a Galaxy workflow and it can be put in different kind of repositories. Some people keep them in GitHub that is at least open normally. They sometimes put it in special reports or local repositories. So it's really difficult to get an idea of what a workflow is. But nevertheless, workflows for us and especially from what I mentioned before, it's clear that workflow contains a lot of knowledge on itself and it's an option that it represents all the knowledge that the researchers obtain and collect it to actually do the analysis. They definitely try different tools and different versions of the workflow and in the end, they created the analysis using one specific version of it and one specific instance of a workflow. So for us, workflows are first class objects. So we want to treat them like that and we want to create an environment to use them and reuse them. So since, sorry, workflows do have a lot of complex definitions. Sometimes the definitions and the description are just in the context of a publication and it's in a natural language and it's difficult to extract except if you read all the publication and all the details and sometimes even the details are not really described in the publication. So we need a few things to actually treat the workflows as fair objects. And for this, we decided that we need two specific aspects. One is to first of all create or use a standard method that enables putting the workflow data that is the datasets that represent the workflow together with the metadata in a standard language that we can extract information out of it. And the second part is to actually have a kind of registry that also works as a search engine that one, without knowledge of any repository or anything can look based on the idea that one travel, what you want to do, which kind of workflows you want to search for workflows and to actually find them. So first of all, how we talk about our create, our create is really lightweight approach for packaging the research data, in this case, a workflow instance, a workflow definition together with the metadata. So for this, it uses two things. First, schema.org, which describes different types of assets. In this case, it will be a workflow instance, but it could be any kind of datasets. It could be external resources. It could be anything. Schema describes a structure for this and it sets a language to describe this. And it also uses jsonaldeed, which is a quite established method to make sense of linked data in the semantic web. So just to be clear, what it does is puts in a package, which could be just a directory folder or a zip file. It puts the files that you want to put, which in this case will be workflow instances. And it includes together with it metadata file, which is in this jsonaldeed format. And in that file, you describe all the elements that you put in the package using schema.org. So as I mentioned, we want to put a workflow, so we need schema type that represents a workflow. So for that, we extended the schema and by using the bioschema.org project, we extended the schema to have a specific type that represents a computational workflow. If you go to this link, you can see all the details that you need to fill to actually represent metadata related with the workflow. So you will see you have to define the inputs, the outputs, the create, and so on. And once you have all these values and all these keywords related with the workflow, you have some metadata related with the workflow. The other thing that I mentioned, we will be needing is, of course, a registry. So a registry collects all this metadata around workflows and lets you search for it. So the key aspect of this is, well, we are now treating workflows as fair objects. So we need identifiers and persistence around this. So we also, as I mentioned before, we have much richer metadata and packaging following community standards. So the description around the workflow is standardized and also, of course, as I mentioned before, the researchers develop the workflow in time. So it can happen that the tools versions evolve. It can happen that the workflow itself evolves. So all this evolution and the steps in the lifecycle of the workflow are also registering this registry together with the provenance. And also what is important is we have a machine-level way to actually discover the workflows. So no matter what kind of workflow it is, you can always ask for it using the standard metadata independently of whether it's a Galaxy workflow, it's just a bus script or it's using an Airflow or whatever kind of workflow management system. So the idea is the makers of the workflows actually submit the metadata associated with the workflow and they keep track of it. It's important to note that it's not an execution format. Portal is just to keep track of instances and the metadata associated with it. So this is pretty much what you see. If you enter the web page, the most important part are, of course, how to search for workflows and how to contribute to, yeah, to our repository. If you search, well, of course you can contribute. It's open to anyone. Take into account that it's a beta release that was kind of accelerated last year, but any input is welcome. If you search for a workflow using any kind of keyword, you will see that an entry looks something like this. You have metadata information, for example, like the number of steps, the inputs and the outputs. You have metrics related with the website accession. And as I mentioned before, the packaging method and the standard method we use to put an instance of a workflow is arrow crate. So you can always export the arrow crate from the website. And there you will see what I mentioned before about the details of what kind of metadata is included in an arrow crate. As I mentioned before, it's not an execution platform, but of course, since you are collecting a lot of metadata, you can add some services on top of it. So for example, in this case, knowing that a workflow is of a type of Galaxy, you can link it to an execution platform. In this case, uxs.du. So you just click on this button and you go to the execution platform and you can run it. Now it's a turn again from Miguel. Right. So last thing on the status code to sequence submission tool that we want to say is about the packing of it. So I mentioned that this command line tool has been wrapped in a Galaxy tool to make it more accessible to researchers who didn't necessarily have bioinformatics background. And on top of that, we package Galaxy itself with the upload tool and the human reads cleaning tool and some of the other objects you need to do a submission into a docker container. So now there's different ways of deploying this or using the tool. The first is you can actually download and deploy the container, whatever you choose to. So this can be your laptop or it can be server at your institution or it can be a cloud service. We have a instance. So we have deployed this container, the submission container COVID-19 useGalaxy.be and it's running with our brokering credentials. So this is for facilitating submissions for Belgian researchers and they can do the submission under our brokering credentials. And the last way you're using the submission tool is to go into one of the public Galaxy servers where the tool should be findable. A little bit more about our brokering. So this is a dedicated instance. As I mentioned before, it has our brokering credentials for the ENA. It is run on the Flourish Supercomputing Center Cloud. So it has resources to do analysis and it also has bandwidth to upload large amounts of data to the ENA. Yeah, if you choose to use this submission container, this is what you will get. The tools, as you can see, there's not as many tools because it's dedicated just for submission and there's some documentation on how to use the tool. But if you want to get the full documentation on here, you can go to our Research Data Management page, our Galaxy page from dot.p. And there's thorough documentation on the use of the submission tool. On top of that, we also have a walkthrough. So it's steps, Chris, that you can watch if you prefer to watch a video rather than read the documentation. So I think that's the last on the submission tool and now I think turn it over to Friedrich for concluding remarks. Yes, indeed. Thank you, Igor. So what we try to do here is use the infrastructure that we had already available before the whole pandemic started from the Supercomputing Center to Galaxy to all the tools that were already integrated, the connection we had with the EBI and specifically the European Nucleotide Archive to really package this altogether as user-friendly as possible to be able to take data within Belgium, submit this to the COVID-19 data port to make it publicly available for globally everybody and then use all of this data to analyze it within Galaxy and as you've seen, use Workflow Hub to make these workflows that are being developed all available with all the nitty gritty details in Workflow Hub. And this is being put forward as an exemplar implementation of the European Open Science Cloud. We're doing this in a global context, building on existing and open infrastructure and I cannot stress hard enough that this existing open infrastructure was crucial to really make this happen and because we could leverage this and hit the ground running, so to speak. What do we plan for future work? We want to integrate a few more workflows into this mission container so we can accommodate the usual or the standard cases. You want to submit also the consensus sequences, the assembled sequences to ENA and we started exploring to simultaneous a little bit to ENA and GIST 8 as this is a practice that many labs already do. Towards the somewhat longer future, variant data is becoming more and more important and we are working together with EBI on integrating that aspect too so you can more easily compare which variant is found in the samples and how that evolves across Europe. And also we're looking to integrate these systems into data management platforms such as Fairdom and Sieg. Ultimately, all this work around COVID fits into the broader picture of a platform towards the future for future pandemics with infectious diseases. So everything we learn here, everything we build here is meant to build an integrated, sustainable health research and healthcare ecosystem. We've included here some references to all of the work that we've been doing and finally I want to thank our funders. Elixir is funded by FWO in Flanders with also support of the Department of Economy, Science and Innovation and specifically for the analysis part we are involved in the EOSC Life project which is a Horizon 2020 project. Due to this funding that was already in place we've been able to switch the majority of the Elixir Belgium team very rapidly to focus on COVID. And I want to stress here also that this is a truly collaborative effort and you've seen some of the people that were involved talking today. We had a very nice collaboration with the Supercomputer Center that provides all the resources for Galaxy. We are closely working with the Galaxy team and Freiburg with the people at EBI to make all these links as efficient as possible. And beyond that we're working really in a global community and that has been a fantastic resource and a fantastic collaboration to make this all happen. We've done part of it as we've shown the data submission and workflow hub was our main focus but all the other pieces of the puzzle are being provided in collaboration with many partners. And with that I want to thank you for your attention and if you have any questions you should be able to speak up through the system or you can put them into the chat box. Of course you can also reach out to us later on if you want. Yes, Friedrich it's correct that the people in the audience should be able to ask their questions out loud but if they came into the session in listen-only mode then you have to reconnect to the audio. So there's a question from Isabel in the chat so I think the slides will be shared also by Open Belgium. We can also provide the link to the Google slides because that's how we made them. So I can already put the link here in the chat now but later on they will be available in the platform of Open Belgium. Yes, correct. And so thanks Ignacio for putting the link. And then the second question is how do you envision collaboration between the licks here at the EU level and the emergent European health data space? Yes, so this is what we are currently planning. So there's a number of upcoming Horizon Europe calls. Not sure if they're all already publicly announced but they will be very soon if not to really make that connection. So to go beyond the viral data, beyond the molecular data and move into the healthcare data and then specifically the European health data space. These projects are being written as we speak to involve all the relevant partners across Europe and so to make sure that this data is being interlinked. The European health data space is in a relatively early phase but I don't have the details yet because the project are still being developed but that's the next thing basically on the agenda to do. Are there any more questions or suggestions? Also, if you have data related to COVID that you want to make available, do not hesitate to contact us. We can see how we can help you and either ourselves or bring you in contact with the right people at the European level. I don't see more questions at this point. So with that, I want to thank you for your attention to open Belgium for organizing this and have a nice day and a nice week. Thank you. Thank you very much for the interesting session. I also wish you a very nice day. To present today what we've been doing