 Okay, so welcome everybody to the third day of the NFCOR Hackathon. Today we have an invited speaker, Elin Cronander from SILIFE Lab and the National Bioinformatics Infrastructure in Sweden, and she will tell us about data management solutions. So welcome Elin, thank you for giving a talk here and the stage is yours. Thank you, I'm really happy to be here today and talk to you about data management solutions. So just a brief background, I have a background in Biomedicine and did a PhD in Neuroscience at EPFL in Lausanne, Switzerland. I then took a little break from academia and joined a startup working with e-learning for kids. But since earlier this year I'm back in Sweden and I work as a data steward at EnBis. So EnBis is the National Bioinformatics Infrastructure in Sweden and it's a consortium between major universities in Sweden and we're distributed over six different locations. We also make up the bioinformatics platform of SILIFE Lab and we have three main activities and so there's a lot of analysis experts that provide supports and engage in different projects both in short and long term and we also meet some infrastructure needs by developing tools as a support service but also in European and Nordic collaborations. EnBis also offers training and workshops within the area of bioinformatics for the life science community across Sweden. EnBis is also coordinating the Swedish LXR hub and LXR is the European Infrastructure for Bioinformatics and we are a small but growing data management team that is working across all these activities and both EnBis and LXR is committed to the principles of fair research data and that means that we are promoting that data should be easily accessed, understood and exchanged and reused and the fair principles are a fundamental part of open science and describe some of the central guidelines to good data management so I want to talk to you a little bit more about that first. So fair stands for findable, accessible, interoperable and reusable and for data to be useful for others it should be fair and in 2016 15 principles were published and they have since then received worldwide recognition and been adopted by many large funding agencies and organizations like the European Union for example. So these 15 principles map on to the four fair categories and the idea is that by following the principles you ensure that your data is fair. This is a nice picture from the Australian National Data Service where you can see all the principles. I'm not going to go through all of them but I want us to look at a few. So in the first row to make your data findable persistent identifiers is something that is very useful and this is a unique identifier that should ideally be globally unique and an example of a PID is a DOI that we heard that each of the NFCOR pipeline each version has their own DOI so it's easy to find the right one and then it also says that you should provide rich metadata so metadata is the data about data and the more metadata you can provide the easier it will be to both find and reuse the data and all the principles that somehow touch upon metadata are here labeled in yellow so you can see that the yellow squares are actually showing up in many places so metadata is something that's really important. So for accessible I want to point out that accessible is not the same as open so we want people to have access to the data if it's possible so as open as possible but as closed or restricted as necessary so there are sometimes good reasons why data cannot be made open and it can be privacy concerns or maybe some commercial interest but even if it cannot be open we want it to be possible to find it and for people to know how to access it how to ask permission to access it and so on. So the last thing I want to point out here is usage licenses because I think this is important and sometimes overlooked so if you have found data you have accessed it it's interoperable so you can do a new analysis for example with it but there's no license so you don't know what you're allowed to do with the data and then you can't really reuse it anyway. So it's a very important thing to also attach a license to your data so when you look at all these principles it's a little bit overwhelming how can you make sure that your data is fair and there is a shortcut to this or at least a way to to get toward fairness relatively easy and that is by using public repositories or international repositories to to publish your data because many large international repositories strive to be fair and they will then handle some of the things for you for example they will make you provide domain specific metadata standards they will take care of some of the technical things like the access and maybe provide a persistent identifier and so on so that's a good start to submit your data to to one of the international repositories and this is just a sample so what we talk about when we talk about data management or when we talk about data management we need to also talk about what is research data and research data is of course all the raw data files that you produce when you do experiments for example like images or sequences or other raw data files but it's also the metadata from your lab books or from the instruments it's code it's process data it's legal and ethical permits so basically it's any information that you use in your research and we want to adopt a broad definition of what research data is when we talk about data management because we don't want to miss any aspects so research data typically goes through a life cycle and here it's depicted in a very nice circle with arrows going in a clockwise direction from one face to the other and to my experience at least that is not the reality the arrows will go back and forth and a little bit all over the place but let's start at the top anyway so usually a project starts with a planning phase where you think about what data you need to use maybe you can reuse existing data your own or already published data or you go directly to the face of data generation when you have the data that you need you move on to study it and do some analysis maybe you collaborate with others so you have some file sharing to do and you have to store your data in the short term and when you're no longer actively working with your data it moves into long-term storage and hopefully then it will also reach the face of data publishing and reuse and at each face of the life cycle data needs to be handled in different ways and this is where data management comes in and it's needed throughout the life cycle but the needs might differ and data management is really an umbrella term covering all aspects of the data life cycle and to me at least the purpose of data management is to make the research process as efficient as possible so it covers topics for dealing with data on a day to day basis like organizing the data structuring it storing it backing up and so on but also long-term issues for preservation and sharing your data and we should also be compliant and consider ethical and legal considerations and I'm sure that all of you are already working with data management in one way or the other already even if you might not think about it but one of the keys to good data management is planning ahead and it's a little bit of an investment so it takes some time at the beginning but it will save a lot of time in the long run so when you should there's quite a few things to think about and to make sure that you actually think about all the necessary aspects it's a good way to make use of a data management plan and the data management plan is a revisable document that describes what researchers plan to do with both new and existing data and it should explain all the steps before during and after a project and I want to come back to the first point that it should be a revisable document and with that I mean that you should during the project you should go back to it and you should maybe update it if new data is added to your project or you can also use it as a guide and many people don't I mean the first time they create a data management plan is probably when it's mandated by someone and this is becoming more and more common that funding agencies or institutions or maybe computing infrastructure is asking you to provide a data management plan and people might just see it as another administrative burden but if you use it it can be used in a very useful way so a data management plan is usually created by answering questions and these questions might come from a template that someone did or from a tool a data management plan tool and if it's used correctly and it asks you relevant questions that helps you to be prepared during the research project and if you provide detailed and practically focused answers and you come back to it from time to time it's going to be very useful so I'm going to sorry I'm going to show you an example of what a data management plan can entail so there are usually different sections that can it looks a little bit different depending on who the stakeholder is that asks for the data management plan or who prepared it but these are some general chapters so first we have description of data what type of data will be generated or will you use existing data and what are from which samples is this data generated and then there are questions about documentation and data quality how will analysis be documented do you have a versioning strategy what's the metadata standards that will be used and there's also questions about storage and backup what are the backup systems in place what's the estimated size of the data do you need to restrict access to data which leads us also to the next section about legal and ethical aspects this will be particularly important if you work with sensitive human data but also it might be asking if you have autorizations to work with animals or if there are agreements in place with other stakeholders and so on and there's also one chapter about accessibility and long-term storage where you will answer questions about how data will be shared and where it will be stored long-term so as I mentioned there are tools that can help you create the data management plan and I have here an example of two tools one is dmp online which is commonly used and the one we use at SciLifeLab is data stewardship wizard and what is great with tools is that we can from the data steward point we can actually go in and create customized templates or questionnaires for the researchers that where we select the questions that they should answer but also we can provide integrated guidelines and try to provide information to make it much easier for the researchers to answer the questions because the difficult part here is not to find questions to answer but to find answers to the questions and a part of my job is to provide these guidelines in different places so one is this tool and there are also other forums for providing guidelines and earlier this year I was involved in creating some minimal guidelines about data management for the NFCOR community so they can be found under the usage menu on the NFCOR website so have a look there if you haven't done so already and they are focusing a little bit on the aspects of computational resources and data sharing and data sharing is a good way as I already said to start making your data fair it's also something that I'm passionate about and like to promote and therefore I'm going to talk a little bit more about that right now and it's also a great entry point sort of into working more actively with data management so I when I prepared this presentation I wasn't sure where to start when talking about data sharing so I actually did sort of a little data sharing exercise that I thought that you could do later on and we will just sort of go through what I think would be useful steps so select the data set or project that you're working on and maybe this could also actually be a pipeline and you think about the output from your favorite NFCOR pipeline and what data type that would would generate and then you the first step once you know what the data set you're working on with and that you want to to publish then you identify a suitable repository and then you check the repository guidelines for different things so file formats what are the file formats that they want you to submit both for raw data but also for process data and then you have a look at the metadata that they require or recommend and do they also recommend vocabularies to provide values for the metadata fields that they have do they have any guidelines about quality measures or is there any file validation tools and also it's good to look at licensing guidelines is do you have to choose a license yourself or is there a license for the repository that is automatically added if you use that repository so when you have found all the guidelines then you need to prepare your data to conform to the repository guidelines so for example maybe you have aligned BAM files but they want you to provide FASQ files and then you have to convert them or for the process data maybe you want to publish some sort of a matrix file which is in excel format but you need to export it to a csv file or a csv file and so on and then once your data is ready and prepared it's time to submit your data and I actually included this step in the exercise because I think you should do it even if it's just an exercise and in general I recommend to publish your data or yeah publish your data as early as possible in the project and there are several benefits of doing so it's much easier to provide metadata when you're closer to the data you remember what you did if you have collaborators they might still be around or if you used if someone else produced your data they might not have provided you with all the metadata needed so you need to go back to them and ask what time what exactly what machine was used to generate the data for example and it's really nice if this is done early and not when your final version of your paper is accepted because then it can get a little bit stressful and delay your publication especially if the repository is a curated one then it might take several weeks before the data can be made public on top of this you also get a backup in a different location which is something we recommend anyway and publishing your data will increase your citations and if you do it early people have more time to cite you some counter arguments might be that how can I submit my data this early I haven't analyzed it but you can submit your raw data and then process data can be linked later another concern might be that you can get grouped and if this is a concern to you most repositories actually use allow you to use embargo if that's necessary so maybe not this week but or maybe but at some point when you feel motivated I think you should go through these steps and to help you get started a little bit I want to show you a flow chart of how you can think when you want to identify a suitable repository and they're also on in the data management guidelines and then of course website there's also links to wizards that help you choose and and other guidelines that are very very more specific so first thing to to ask yourself is if the data contain personal or sensitive information because then you need to choose a controlled access repository and one example of this is the European genome phenome archive the EGA but if not we want to choose a discipline specific repository if one exists and it's just because it's easier for people to to find it and one thing to think about though is that the repository also needs to be sustainable so so it might be worthwhile avoiding very small repositories that is maintained by a research group or a small consortium because it's quite likely that they will run out of funding or not be maintained for another reason and your data will disappear from the public domain at least and so this discipline specific repositories is highly recommended but if there are none then maybe your institution has a repository which you can use or your sort of last resort is to go to a general repository like fig share or sonodo or you that so now let's say we have picked our repository and for this example we will look at the ENA so now we want to look at their guidelines and we're not going to go through all of them but I have an example here of metadata fields when you submit raw reads to ENA so there are different things that you have to submit to ENA different metadata objects or or data objects but this is just one example so they have listed 11 fields so what the field is called in this short description and then for some of them they also have listed permitted values and that is and the reason for that is that they don't want you to just write whatever you call the platform or the instrument that you use but to actually stick to to a list of values to make it interoperable and also easier for to be machine readable so this is for the library source for example you can choose between genomic genomic single cell transcriptomic and so on so this is good because then you know and makes it easier to actually fill in the metadata fields if you know which values you can choose from so I hope that at least some of you will go through this data sharing exercise later on but there also I also included some a slide from my colleague Miklas who has a longer talk or lectures about life science data management that can be found on this link and you can listen to them to get more information in depth but I wanted to just share the take-home messages from here so consider doing a data management plan for your project and I hope I have already explained that enough and plan for submitting raw data to public repositories as early as possible organize project metadata from the start so for example if you have to provide species in your metadata and then it's good to to not just call it mouse maybe but you can actually call it the scientific name or whatever vocabulary that's a repository that you're going to submit to requires and pick a thought through file and folder structure organization for your computational analysis this is also something that really helps you in the long run to to set up a structure and then stick to it and there are many many ways of doing this but and there are also many recommendations out there and strive for reproducibility both for data and code and be aware that there are legal aspects to processing human data and you can also read more about that and find some links in the on the NFQR website and also ask for help if you need it so you are not the only person thinking about these problems and at your institute or different institutions might handle this in different ways it might be the library and that is dealing with data management questions it can be a data office or maybe it's even the IT department so look around and ask for help if you need it you're also very welcome to reach out to me and to my colleagues and I want to thank you for listening and before we move on to to the Q&A session I I hope I've already inspired you a little bit to dig deeper into data management and I want to share just sort of a final motivational quote by Rachel Ainsworth she's an astrophysicist at the University of Manchester your primary collaborator is yourself six months from now and your past self doesn't answer emails and I think that is quite to the point thank you so thank you very much Ellen for this interesting talk as a bioinformatics core facility here at cubic this topic really touches us closely we do need to share our data with public repositories and do data management plans not only for our own data but also for collaborators so that was very interesting to see your take on that and we do have one question from the audience and it goes more in the direction of what can we do with NFCOR pipeline so Phil asks are there any data management plan tools or formats that NFCOR pipelines could somehow adhere into sorry I I lost my connection somehow oh don't worry about it if you want I can repeat the question yeah yeah so so Phil asks if there are any data management plan tools or formats that the NFCOR pipelines could adhere to to make to make it easier also to to follow the fair principles um maybe not tools but one thing that I was thinking about is at least to make some yeah I think it would be really awesome if for each pipeline there would be a recommendation about repositories for example and then if then there could also be recommendations about metadata and standards and and file formats and so on and to make sure sort of that it's probably already the case that the pipelines produce the right file formats I would guess but that is something that one could um look at yes thank you that's something that we will really consider for for our pipelines