 Welcome to this presentation of multi-omics sharing on sleep regulation. My name is Nastasia. I'm a PhD student. I will discuss how to put fair principles into action. I will present our recently published paper explaining how we share a multi-omics project on sleep regulation in a reproducible and realistic manner with its challenges and its solutions. Open data and open sharing are words often used in the scientific community and there are requirements of many funding agencies. However, it is less known concretely how to share complex data sets so that it can be reused by order and the amount of work required by such a task. Indeed, it is not enough to make available only the raw data or the derived data, but sharing should also include the code to perform the analysis and the metadata, which are descriptive information about the data. For example, for RNA sequencing data, the raw data are reads, which are sequences of nucleotides. The derived data can be gene counts, which quantify the level of expression of each gene and the code details all the steps to perform the raw, to transform the raw data into derived data. The metadata could contain the date of the experiment, whether the samples were pooled or sequenced separately and the sequencer used. All these elements help in the analysis and the interpretation. For example, knowing which sequencer was used gives information about the accuracy of the measurement. To guide researcher to share the data, the fair principle were created. This initiative helps to consider the different aspects, a sort of theoretical checklist for data sharing. The data and metadata should be finable, which means they are indexed and you can search for them with keywords and they have a universal identifier, for example, a DOI to refer to. It should be accessible. The data can be retrieved either freely, either with an authorization procedure if necessary. And the metadata are available even if the data is not. They should be interoperable. They are in a format and a language broadly used and they follow standard vocabularies when existing. And reusable, which has to do with legal license to clarify who can use the data, whether or not it can be modified or used for commercial applications. However, these principles are not always easy to apply on a project with multiple types of data. This is particularly the case in the field of system genetics, where the projects have to combine data from different biological levels to gain a better understanding of the molecular pathway underlying certain phenotypes or behavior. Especially in this field, data sharing and reuse is of great interest because this large-scale assay are costly and complementing datasets can often be combined for exploratory analysis, even if they're initially generated for different purposes. In all case, the project aims to better understand the regulation of sleep with this genetic component and in particular the effect of sleep deprivation as an environmental perturbation. This work was done in the BXD mouse panel, which is made from two parental strain, Black 6 and D2, which are commonly used by many labs. And they have different phenotypes, including their sleep behavior. These strains were crossed and then imbred for many generations to fix the genotypes. So the BXD lines are each a different mixture between the parental strain. And these lines allow a different group to do research on replicates with the same genetic background. This mouseway characterized at different biological levels. We have more than 10,000 genetic markers and most of them are SNPs. We have the expression levels of thousands of gene in the liver and in the cortex. For the proteome, we have predicted protein variation based on the genetic variation. So if there is a SNP in a coding region, then you can predict variation that will be in the protein. This is qualitative data. We don't have protein levels. For the metabolome, we have levels of targeted metabolites in the blood. And we have more than 300 phenotypes, which are derived from the locomotor activity and the electrical brain and muscle recordings, which are describing, for examples, the amount of time spent asleep or awake, the sleep quality or the sleep distribution over 24 hours. For sharing data, they have to be deposited somewhere. The problem with multiomics data is that the base database is different for each omics. Specialized repository, such as GEO for the RNA data, are preferable because they have specific tool to search, explore and retrieve data. This requires that the data follow a precise formatting before sharing. However, not all omics have a title database. Generalist repository, such as FiCshare, allows more freedom on the type of data and formatting. But this makes it harder to explore and compare data between data sets because they may capture a similar type of data without necessarily using the same vocabularies. And for gene names, this is relatively well organized. The gene ontology can be used and different packages allow to transform from gene names to identifiers. But, for example, to standardize the description of our sleep phenotypes, the human phenotype ontology was the closest ontology available. However, it was made more to describe disease-linked phenotypes in humans whereas we have healthy phenotypes in the mice. Therefore, the correspondence is not one-on-one and it had to be done manually. But this situation is not unique to sleep. It can be found where there is no consortium, actually. To really test if the analysis that we were doing were reproducible, we needed to actually try to rerun them. And when I started my PhD, I was new to this project and my first role was to try to rerun some of the steps of the previous analysis to check if I could find the information that I needed and if I could verify that I would reach the same conclusions. And this process took about two months, which include time to understand precisely what the code is doing. And then, together with Maxim, we assessed how robust were the results to updating the mouse reference genome that we used. So going from the version MM9 that was available from the beginning of the project to MM10, which is more recent. So on the graph going from the yellow to the red and the steps that are affected by this change. And these took about one week. And overall, this process helped to improve the documentation and to catch and correct small mistakes in the code. Our general strategy to organize the analysis was to use three layers or levels of abstractions. The low layer is to reduce and transform the raw data. The intermediate layer is for single omics analysis and different types of correlations. The high layer is for combining all the different omics. This helps to structure the code and the analysis. And the different layers have different characteristics. For example, lower layer steps often are computationally demanding and are coded in different programming languages depending on the assets. However, analysis in the intermediate and the high layer can more easily be performed in a unique programming language. We chose R because it is open source and widely used with lots of packages available for data analysis. And the R-magnum format can be used to generate reports combining code, plots and comments in one single document. And you can easily show a big data table or you can try to show or hide the code if you want depending on your needs. And it's also possible to use the session info function to allow to keep track of which packages were used and in which versions. It is important also to document the workflow. And we have this general scheme that orient us. So if we take, for example, the differential expression analysis, we can see that this step is performed by your R-magnum file and we can find general information about what this file is doing, where it is located and which other input and output of this analysis. And these files are also themselves described and so on. We also have a website to mine the data. The idea is that anyone can explore the results even without having programming skills. To conclude, I explained you challenges and some solutions to concretely put the fair principles into action for a multi-omics project. The entire sharing strategy of the project spread up on more than one year. And I think it is important to highlight that making the data more reproducible is a complex process that takes time and effort and cannot be reached only with guidelines. However, a higher reproducibility improves the quality of the science and opens the door to more reusability and thus contributes to valorization of the data. I want to thank all the people who help in this process in particular Ioannis, Paul and Maxine. And thank you for listening and I wish you all the best for your data sharing.