 Okay, good afternoon. So this session today, the rest of today will be sort of in between scientific methods. It will be about data management and reproducibility. I will start with the first part, talk a little bit about issues of research data management and also send some solutions that we had at the G-note that the German neuroinformatics note have developed. And then later, Michael Denker will present and focus more about reproducibility, data analysis workflows and application of the different tools and methods. So as a background on why we are talking about this and why this course, for example, is organised by the German neuroinformatics note where there is something like a German neuroinformatics note, the G-note, a little bit of a background. So G-note is part of the INCF network. So the INCF is the International Neuroinformatics Coordinating Facility, which is an international organisation that came out of the OECD Open Science Forum that was held in the late 90s, early 2000s. And there were working groups on neuroinformatics which recommended establishing an organisation to coordinate neuroinformatics globally. And this led to the INCF being established in 2005. The INCF is a globally acting organisation, but it has a central secretariat in Stockholm. And at the moment, 17 national notes that is countries that have joined this organisation. Germany was one of the founding members of this organisation. The idea being that global efforts, globally coordinated efforts will be needed to develop neuroinformatics because it doesn't make sense that every country with its funding agencies and support is working in isolation on these methods. Science is a global effort, essentially. And also the development of tools and methods for science and standards should be done at a global scale. And neuroscience was seen as a field very much in need of such global coordination. So this basically is reflected in the mission of INCF as being a global independent coordinator of neuroinformatics developments. As mentioned, Germany was a founding member and at the same time as the INCF was established, the German government started a funding initiative, a very large funding initiative on computational neuroscience. It was called the National Network Computational Neuroscience. Now it's known as the Bernstein Network. This is the result of ten years of funding for computational neuroscience. And computational neuroscience is meant not just, as you might see me, just funding theorists who make models, to specifically fund the exchange between experimentalists and theorists because it was realized, of course, that progress in neuroscience is very much dependent on experimentalists and people who do models and who do data analysis working together and joining forces. So this was one of the core goals of this network and of this funding. And now we have a fairly large network with more than 200 research groups all over Germany. And we also have in this network several infrastructure facilities. And do you know, the German neuroinformatics node is one of these, it's hosted by the LMU in Munich and its goal is to develop solutions to help the scientists in data management and data exchange easier for the scientists. The Bernstein Network also has created the Bernstein Conference, which is an annual conference. Probably many of you have heard about it, maybe even have attended. It's meanwhile the largest European conference on computational neuroscience. So this is kind of the setting of where we are with the G-note, the link between this National Bernstein Network and the INCF Global International Community. So what do we do? What is our goal? The German node was specifically established to target the issue of cellular and systems neurophysiology because there it was seen that this field is very complex and there are so many such a large diversity of methods and data structures and science questions that it's fairly hard for scientists to collaborate. If you think of an experimentalist and a theorist, for example, who want to join in exchanging data and working on models, then it's very hard to explain to the person who has not been involved in the experiment, for example, what the data mean and how to read the data and how to analyze. So this was one of the main targets because also the number of groups working on neurophysiology is very large in the Bernstein Network. So over the years we've been working on solutions at several levels. We've been creating data conversion tools to make it easier for scientists to read data in different formats of convert data between different formats. We've created methods for data and metadata management to help the scientists in the lab organize their data and metadata. We provide data sharing services, a platform where you've seen the GIN service. This is basically a replica of what we are running in Munich where everyone can sign up and can share their data. We also provide custom solutions for data exchange and we're also involved in teaching and training activities. One of these teaching and training activities is where you are right now, the under Advanced Neural Data Analysis course. We've also been running a shorter course on Neural Data Analysis which was one week long and which provided a low barrier entry for neuroscientists, PhD students to start working on their code and start improving their data analysis in a kind of easy going way but getting in touch with the code. What you are doing here as well but you are doing it at a little bit more advanced level will, since the course is longer and will involve different further exercises and tasks of course go more in depth. These were our courses on data analysis that we provide. We've also been running for several years now the Advanced Scientific Programming in Python School which is a summer school that introduces young scientists to good scientific programming practices. This is not a course for learning Python, it's a course for learning to code better in Python because you know more about the background of the language of computers as such so you can write code that is better maintainable that can be shared so others can use your code and so forth. Currently so this year this will be held in September in Italy and the application period is currently open so if you're interested please apply now. We're also running irregular intervals, workshops and tutorials on data management. So let's go back to the issue of research data management. This is what I want to talk a little bit more in depth now. One of the focus areas that we have at the G-node. So the question is why do we care about data management? I've kind of a little bit alluded to but the question is why should you as a scientist care about data management? So one reason might be because research data is typically at the core of science. Science depends on being grounded in data and in fact we're typically focusing on publication as the output of science but there are other kind of notions that even sort of degrade the publication, the scientific publication to something which is just an advertisement for the actual science, for the scholarship. And what is the actual science? Well it's in here in this quote by quoting or paraphrasing John Clairbaugh this is from a paper of Buchhide and Donohoe. It says that it refers to computational science, to the field of computational science and there it says that the actual scholarship is the complete software developing environment and a complete set of instructions. So all the code that went into a scientific result. You can see this also or translate this easily into experimental science when you add the data. So what is the substance of your science, the scholarship? It's not the publication that comes out of it but the entire data acquisition, the experiment data acquisition, data analysis including all the code that you use because this is something that principle other scientists should be able to inspect and to see how did you get to your results, how did you get to your conclusion. So this is kind of the one argument why it would be useful or important to think about data management because you're supposed to be able to have your data available. And this also refers to something that has been a topic of many talks in the past years. The so-called replication crisis or reproducibility crisis when it has been found that many studies started with looking at psychological papers on psychology but this also applies to other fields and I'm sure also to neuroscience that many studies that are published are not really reproducible. And one problem with not being reproducible is that when you find that something, or you have a suspicion that something is not reproducible, you would like to go and check how did the original scientists, how did they get to their results, right? And this requires having the data, having the analysis code and so forth all available. Another reason why you may want to think about data management in your lab is because there are a lot of external either incentives or external pressure that's coming up. That's existing already to some degree but it's also coming up. So the funders keep requiring it. They have been for several years so when you, I don't know how many of you already wrote a grant proposal but your supervisors, when they write their grant proposal typically they have to include a few sentences about a data management plan for the project. This is so far something that is not really strictly checked or reviewed typically but since the entire scientific landscape is starting to become more open, this will certainly be more rigorously checked in the future and it's good to prepare for that. For example, at the EU level, the Horizon 2020 program, they started a couple of years ago with a so-called open data pilot where they were requiring or were encouraging the scientists who were getting EU grants to make the data open and provide a plan for how they would do that. Meanwhile, so over a couple of years, three or four years, this changed now. First it was kind of an opt-in thing that when you wrote the grant and it was not explicitly stated, this will not affect the review of your grant. This will be of your proposal, this will be completely on a scientific basis. So it was completely voluntary to say, okay, I want to make my data open after the study. But now it's the other way around, now it's an opt-out. Now it's the default, EU requires this as the default to provide a plan how to make your data open and if you don't want, then you have to opt-out. So things are turning, things are getting more into a notional understanding that science should be open and this is in fact the core of science that everything is open for discussion. Besides the funders, also there are several publishers now or several journals that require that you provide your data with a publication with the same kind of idea in the back that when you have a publication, people should be able to go and check in more detail how you arrived at the results that they are presenting. This at the moment typically is also not very heavily checked or reviewed. Like if you, maybe one of you has already published a paper where it was required to provide also the data. I'm pretty sure none of you has gotten any comments from the reviewers about the data. Now they are typically reviewing the paper and nobody is looking at the data. But I'm pretty sure this will also change in the coming years as the publishers also learn, they also have to recruit basically reviewers for doing it, for looking at the data. Some journals have started kind of test trials to do that and I think this is evolving and it will become more and more mandatory basically to not only provide some data, something there but also something that is accessible and readable by other scientists. And so we're going into a phase where we will see an increasing openness and this will also be reflected in the expectation of other scientists that come and say okay, here's your publication, where are your data? So we should prepare for that. And the last point or the last argument that might even be the best one is selfishness. Because I think it's in our own interest to have a good data management in our lab because it saves us time. There are studies or kind of informal studies but it's repetitively being quoted that scientists spend a lot of time just what's called munging data, meaning trying to find data, trying to read that data even in their lab. Just in their lab trying to read that data, trying to read data of their students after they've left the lab, that's a typical situation. A student is doing a PhD, maybe the data are even published and then the student leaves of course and maybe two years later there's an interest in going back to this data and it's very hard for anyone in the lab to access the data. And so it takes a lot of time. So one estimate is that scientists typically are spending about 60% of their time when they work on scientific topics, 60% of that is on data munging. Not sure if this number is really correct but it is a substantial part. I think everyone knows that from their own experience and the question is wouldn't it be really nice to only spend half of that time and have the other time free for really thinking about the science? That should be the goal and so I think this should be the largest incentive for us to try to have a good organization of data and consistent organization of data in our lab so that we make life easier for ourselves. So why is it so hard? Why do we have to spend so much time on data munging? One reason, at least in the neurosciences is that we have typically very complex experiments, we have complex research questions, the data are complex and this complexity alone makes it hard to organize the data, to find data, to document data and so forth. We are also very often relying on collaborations with others so we have to make sure that data is available and we have to exchange. The volume of the data that we are acquiring is increasing, probably all of you are experiencing this with all the new methods for recording. The volumes of data sets are increasing massively which creates also a problem in data handling. So all this together basically contributes to this situation that when we want to access our data, data that we are not immediately working with but we created in the past or generated in the past and want to go back to, that it typically takes an effort, time and effort. And having a good data organization in lab benefits us, it reduces this effort and so this, since when the effort is reduced it means we can more easily get to our data which enhances the reproducibility because we can better understand what we have been doing and it also facilitates the ability to share data with others. Or share data with ourselves. So that's again the situation when we talk about data sharing. It doesn't only mean we share with others but we also share with ourselves. That's the most important condition that we should try to arrive at and it starts in the lab. Typically in the lab we want to make sure all data is recorded, all data is available, all the information is recorded. Typically in labs there's a lot of hidden knowledge that kind of everybody knows and kind of it's transferred from HD student to PhD student. And it's also important if we want to be able to understand our data say a few years in the future that we also document this. So kind of trying to get into these practices of documenting all the information is really important. Yes. Well hidden knowledge is kind of the knowledge that's in the lab that is maybe verbally, when a PhD student comes in and he learns how to handle the setups and everything, how to do experiments. There's a lot of information that they learn by being instructed for example. But at no point is the information written down somewhere or provided with the final data set. But it may be important. So we always set the setting of this amplifier to this and this which will be important for the data analysis afterwards to know what kind of filtering appeared. Because the setup is always the same, it's not documented. So this is what I mean with hidden knowledge. The next step after sharing with yourself is sharing with a collaborator where you have an interaction where you can provide a lot of clarification and you can answer questions and so forth. And you select specific data sets. And then the yet next level is then sharing with the world which is exactly the situation when you're done with your study basically you're publishing your paper and then maybe you want or you're asked to provide the data with your publication or you decide to do a data publication. So this is also a possibility now some journals accept or have kind of an article type which is called data publication where you do not describe a new scientific finding but you describe your data set in detail, make it available so for others to use it and to reuse it. So this is all then the situation where you want to share with people you don't even know. Someone else might take your data set, might reuse it and here it is specifically challenging because there will be no interaction so you want to make sure that the scientists who access your data have a chance to understand it because otherwise they might do things with the data that are not appropriate. So you also should be in your interest to make the data be readable at all and not have some artifact there, some data there that some files that nobody can understand and nobody will ever be able to reuse. So you have basically technically shared your data. It is there but semantically nobody can make sense of it. This is a situation we want to avoid and the way to start I think is if we try to improve the situation here in the lab, in our lab. This will benefit ourselves because it's easier for us to do our research but it also will make it easier for us to provide data in a form that is understandable for everyone else. And by the way we've already had questions if there are more questions just ask me anytime. Regarding the situation of sharing with the world again I want to briefly introduce this kind of notion or these guidelines which are called the fair principles which some of you may have heard about already it's something that has been come up a few years ago and has been taken up a lot by scientists and also by providers and also by funders. The fair principles, it's an attempt to provide some guidelines of how you can make data better reusable at the end which means first of all they need to be findable. Once they are found they need to be accessible meaning you have a way of actually getting to the data they need to be interoperable that is they need to have a chance to use them to work with them and they have to be reusable that is they have to legally be reusable but also have to be understandable. So these guidelines are kind of thought as a course framework of how you should try to of goals to achieve. Primarily I would say this was thought of the situation that was considered was that the data are there and how do we provide services that will achieve that but a lot of this depends on the actual data provision the actual data that the scientists provide that are in a form that enables to achieve all these criteria for example all the documentation, all the metadata information about the experiment how the data work quite and so forth this is all important to be able to reuse the data in the end. On the kind of more general side the recommendation is for example to make the data finable is to use globally and persistent identifiers you all probably know the DOI this is an identifying a digital identifier the persistent identifier that is used to indicate or to identify resources, papers typically or data sets the idea or the goal would also be to use machine readable descriptions of a data in order to be able to of automated services to find your data and know what's in your data. Accessibility means that it must be clear how the data can be accessed it doesn't necessarily mean that it's fully open but there must be a way of finding out how you can actually access the data interoperability and this again puts some demands on the scientists what kind of formats do you use to store your data to provide your data in what kind of terms do you use for annotating your data and the reusability also has this aspect of legal reusability that is licenses when you provide data you should always have a license so that others know how they can use the data this is another topic to consider and at the Gino we provide data publishing services and there are also the users and scientists who want to make a data set available they are required to choose a license and then we often get questions what license should I choose what does it mean, what difference does it make so software developers, the open source community for example they have a long experience with licenses and there's a lot of licenses for software for research data it's not so common and we also have not so many licenses different licenses that are appropriate for research data one very common set of licenses are the creative common licenses there's a website where you can get information on those and they basically specify different levels of reuse that you allow or different conditions for reusing your data the most open one is the public domain dedication so this is strictly not a license it's basically a waving of all rights that you have as the author or the creator so you basically are waving your copyright and everyone can do whatever they want with what you provide then there's the CCPY license which requires that someone who is using your data will give you credit they cite you or cite the source where they got the data so this basically is a way of getting credit for your data, for your work you don't have to just dump it somewhere you can always ask people who are making news of it that they also acknowledge you and there's then different further constraints basically that you say okay any work that has been done with that and if it's published it must be published under the same conditions I don't want my data to be used in any commercial application or whatever so there's different further constraints as you see here for a start the CCPY license is typically this is one that most scientists choose because it's very easy and it also is very open so another thing I just wanted to mention here at this point where we're talking about more general services and general issues in data management something to look out for this is something that's currently coming up a funding program by the German government who also have recognized that improving the situation in data management is really important if we're interested in making progress in science and not only in neuroscience but in the entire scientific landscape so this is a funding program to specifically target issues of data management, data provision, standardization and so forth which is intended to create basically an expert network of training scientists together with infrastructure providers to better work together and to create better solutions for data management that kind of connects all the pillars of the science system the universities, the research institutes, IT centers and so forth the idea is of course that this is not isolated in Germany and all the groups that come up with proposals they must also make clear how they connect to international initiatives this is of course a very important requirement so there's also a lot of infrastructure at the European level the European open science cloud was started officially providing a lot of services that most scientists don't even know about and they don't know how to use them and the idea is really to work closely with the scientists bring them together with providers, service providers and also of course connect to international and provide solutions that are interoperable with international solutions of course and so we're currently preparing a proposal for a specific neuroscience research data infrastructure group that will of course interact with all the relevant other groups here in this program but also will make sure that solutions that are developed here are developed in coordination and in compliance with international standards and international solutions so let's get back to the data management in the lab I wanted to make clear that I think that the best way to start is to start really in our own labs and try to improve the situation for us and what is the situation? well the situation is for example you did an experiment a year ago and have worked on something else and now you want to go back to your experiments because maybe you read a paper that presented some result that might touch on what you did back then but in order to compare that you need to find out what was the exact stimulus parameters that you used what kind of frequencies did you use question is how do you find out say this is a parameter that was not really the control parameter used in your experiment in your study what was the kind of a side parameter did you even record it where could it be and there's lots of places where this can be typically in a typical experimental lab could be in the source code of the software that you used could be in a file that was written automatically could be part of a file name because you have some convention of how you name your files could be that it's in your notes in your spreadsheet that you use or in the notes that you write by hand or somewhere right so how do you find out do you have a way to find out and how can you make sure that in the future all the information that is important or might be important for a data set is available and stays available this refers to the overall notion of metadata besides actually the data that you are recording from your recording equipment there's a lot of metadata metadata means data about data so metadata are also data and partly very important data because for example it's very important to know what your stimuli were that you were using when you're recording something and there's a lot of metadata that can be in principle can be recorded in a way so that it's persistent that you assure you can access it even years from now besides the recorded data you need to know in order to understand your recorded data which are typically just numbers in a binary file you also need to know how to translate these numbers into the scientific quantity that you were recording was it a voltage, was it a current so you need conversion factors you need units, you need sampling rates these are very obvious otherwise you cannot do anything with the data but the question is is it always recorded then you have a lot of data that you could call heart metadata because it is data that you can record as you know beta as say key value pairs where you say temperature was so and so many degrees and sampling rate was so and so many samples per second and this you could all record in some electronic form some machinery to perform so you can have tools that make use of that and make sure that it's all collected you can automate a lot of that and automation is a way of ensuring that it's there and then there's only a small part of the metadata which is kind of very hard to really put down because it requires explaining a lot like why did you do the study what was the question you were after this is typically what you then provide in a publication but all these kind of heart parameters you could in principle save you could store and you could make sure that they are available and this is something if we try to do that we will make our lives easier and the lives of our colleague scientists so how could we do that how do we do this it's not of course something that just comes for free basically we have to work a little bit on it we have to prepare ourselves so we have to think about it how can we organize our metadata collection we should try to be clear to write down or record things unambiguously so that when we think of ourselves two years from now if I would read this the name for this parameter would I know what it is would I understand it we should try to save all the information even if it's something that we think now is not important but in the future or for other purposes for other reuses it might be important and probably the most hard thing is to be consistent to store things in the same way so the typical situation that you may have seen in this little cartoon here a few slides ago by the way these illustrations are done by Luba Sehl who's at the Foschungszentrum Uli she did this as part of her PhD thesis and I use this with permission she's fine with reusing this so for example you have some files about your dataset and the naming conventions are different for each file every day after every experiment you make up some names that's something that in the future will be very hard to understand so if we try to be consistent and that requires thinking ahead even before we start the experiment think of how we are going to do that that will go a long way already in ensuring that the data stays available for us accessible for us so how can we make this easier for us planning ahead just as much as we think about the experiment the experimental question how do we design a task for the animal for example we should also think of what are the kind of data that we will acquire what are the kind of metadata and how do we collect those and how do we store them and then once we have a good idea about that automating things again is something that's very important because that at the end then saves us time and we don't have to even think about it anymore that would be of course the ideal case it also avoids errors so for automating it's of course good if you know how to write scripts whatever scripting language you prefer but if you acquire some skills that help you automating things on your computer that would be very helpful and then of course you can use tools and you should also use conventions that other scientists use and use tools that are available to help doing this and I would like to so the situation you mentioned is really common that every lab has their conventions but the conventions are different between labs that's a typical situation and the question is there an organization that is taking care of that well in principle the INCF has some of the goals of the INCF is to promote common conventions and standards currently they have started now an activity where they identify tools and best practices in the community and endorse them meaning that they are reviewed and then the INCF says this is an endorse standard meaning it's good to use them you can use them, you can rely on them and this will also lead scientists to more and more use the same tools and conventions but it is a real problem it's a far way to go just because the space of problems to solve with tools and standards is so large but we're trying to make progress so I'd like to introduce a couple of tools and the formats we've been working on in the past years always with in mind, having in mind how to improve that situation the typical situation that we have recorded data on one hand and we have a lot of metadata a lot of information that is acquired during an experiment typically from different sources typically different machines running different computers running that might produce metadata you have some other measurement devices you do some recordings by hand so all the different information in different formats and the question is how do you bring them together and typically they are also remote from the data you have the data in some files and you have metadata in other files how do you relate them can you get to a situation where you link ideally you would want to link the metadata to the recorded data so that you are able to find data based on the recording conditions for example stimulation conditions and so forth so how could you do this well over the years we've been trying to come up with tools and solutions for these kinds of problems the challenges that we face especially in your physiology the field is so diverse we have a lot of different techniques we have a lot of different methods we have a lot of different preparations different species are used different experimental paradigms and so forth different file formats every data acquisition system produces data in different file format there are no common standards what to converge to and that makes the situation really hard so our approach a few years ago was to start with a well-defined data model for the actual data for the recorded data to have a representation that is accessible and understandable for scientists and to have very flexible methods for the data annotation because there we have the large variety of approaches, different experiments every experiment creates different types of metadata and therefore data annotation methods must be flexible, must be adaptable to the situation in the lab or in a specific experiment and then we try to come up with solutions how to integrate the metadata with the data so that you can specifically link metadata with data so that you know what metadata belongs to which records so starting with the representation for the neo-physiology data we were involved in the development of the Neo model the Neo Python package many of you may have heard of it it's a package that defines names, object names for electrophysiology data which gives you the ability to represent whatever ephys data you have in the same form to basically work in a common way with electrophysiology data it defines objects for analog signals like voltage traces for spike trains provides ways of grouping them according to time or space or electrode so this is a very useful way and kind of minimal way but nevertheless achieving a common representation it's fairly popular it has been adopted by many labs it's easy to adopt it is Python, yes but if you are working on Python this is basically the facto standard for representing electrophysiological data another component of the Neo package is this set of I.O. modules enabling to read a lot of different electrophysiology formats and this again gives the ability to whatever your recording system data acquisition system will produce the files you will be able to read them with the Neo with the Neo modules and that makes that very versatile and very useful as a way to to represent the data the recorded data now we still haven't touched the metadata yet so we have a way here to represent the recorded data but what about the metadata the metadata as I mentioned has to be a very flexible way in a very general way so we developed a very general model for a format for recording metadata in terms of key value pairs all you need to provide is the name of the property for example sampling rate and then you provide the value together with the unit that you have a scientific representation you don't have just a number but you know what it is and you know exactly what the quantity what the quantity is and this content what you want to store here is separate from the format so AutoML provides a format where you can organize these kinds of information these kinds of key value pairs in a hierarchical way so you can group them which is a way to get an organization that reflects the structure of the information you want to store typically you would group different items different metadata items for your recording equipment under one section and then for the stimulation under another section and so forth in a way that is suitable for your experiment but nevertheless you will end up with a format or with a file that is machine readable can be used can be read in by your scripts or by tools to process the metadata and to meaningfully analyze so this idea is a fairly simple format and besides the file format the actual file format we are also providing terminologies so the idea is that you would put in everything that you know and then at the time when it gets known and you can do this with the tools for this AutoML format that you add at some place in your metadata tree it's a hierarchical tree that you add further items there so you wouldn't have a dummy entry or anything which could be misleading which I think, if I get it right is what you are referring to but you would add it afterwards so this is something that can be done dynamically and the idea is also that since the metadata is acquired in different formats you have basically a common format where you can merge all this information together in a unified way well the format per se does not make any constraints like that what you can do and what some of the tools do is what you describe that you basically have a form and you are supposed to fill that out so this is at the tool level basically then you need to make sure that something like that doesn't happen that someone fills out just something to have it done without it being necessarily being correct and by the way besides the URL a way to reference this is also something I might mention here a way to reference these kinds of tools is by this so called RRID this is a very new initiative to provide persistent identifiers for resources for digital resources that you are using in your science like for example software tools but also substances or whatever there is already a huge database for those and the idea is also that you would use these kinds of identifiers in your methods section so that there can be automated services that find literature based on the used methods for example and on the other hand the other way around enable tracking the use of methods by mentioning in the publications and this is a way that is a much less ambiguous way of referring to something for machines much less ambiguous than having a description of what you were using so I would encourage everyone who is writing a paper to go through the methods section I think that you use check whether there is an RRID for it and then just enter it ok so I said that the method is really flexible in terms of what you can store because you need to be able to adapt it to your specific experiment but then also at the same time we want to have conventions we want to use also what others are using so we are also providing terminologies that you can use so the format itself doesn't enforce any of that but if you are looking for a term for how to name some item you want to store you can make use of these terminologies and the other way around if you are using specific terms for your metadata then you can also provide them with the terminologies the idea is to have kind of a community collection of terms that are used for certain metadata in neurophysiology and as I mentioned there are several tools there are different libraries for different languages we have a visual editor where you can then once you have a file for example you want to make additional changes or additions manually so my automail is that a lot of metadata collection should happen automatically by scripts that collect the metadata from the different sources and bring them in the common format but then you also want to make some edits manually then you can use a visual editor or you can a very nice spreadsheet front and that turns the automail files into an excel sheet where you can make edits and then you can convert it back again to the automail file this was written by people at the Forschungszentrum Jülich so how would that look if you are interested in kind of case study for using automail to collect all these different sources of metadata in a lab in a monkey experiment I can refer you to this paper here by Lua Zeel she is the one who did the research and she has been using automail to collect the metadata in an experiment where recordings were done from monkey visual and motor cortex while doing a behavioral task and you see here a schematic of the metadata, data and metadata flow in this experiment and you see that there is a lot of traces for different types of the different parts of the metadata and at the end they all should be collected together in order to enable meaningful analysis and in this paper it is described some of the procedures and methods she has come up with to facilitate that what this then provides is not only data sets which are annotated in the sense that you have all the relevant metadata but also it facilitates the data analysis and then working with the data for example what they could do is they could analyze just the metadata you don't have to go to the data to doing that to understand how far along are we in the project basically what did the animal make how many trials were correct and so forth and so this is just an illustration of the statistics of performed sessions or even trials over a week here different trials with different conditions so you get a very nice overview automatically you can write basically a script that does that based on the metadata collected so this helps you keep track of what is going on and how far you are along with the data collection basically so I presented you this data model of NEO for representing electrophysiology data I presented you a format for metadata collection so how do we bring this together we've been thinking about this and came up with a data format that enables storing data in files in a way that you have a data model for the data that is inspired by the NEO data model but you also have the other metadata and you can make meaningful links between them and so this enables then having both metadata and data together and being able to go from the metadata to the data to the metadata and so for the data part here we created a data model that is similar to the NEO data in that it describes basically scientific data you have an n dimensional data array you have information about the dimensionality about the units about what kind of data it is and you have also a way to link between the data array so you can establish relations between the data this can all specify in this data format which makes it a format where the data are discoverable are understandable and where the the data structure basically in the format reflects the structure of the experimental of the underlying of the underlying data the format which is called NICS is defined by an internal data model and an API and it writes the data to a file which is in the HDF5 format which is also common standard for scientific data so any tool that can read HDF5 files can be used to read data in this NICS format and we also provide different libraries for different languages so the benefits of having data and metadata in a common format with meaningful links is that you can find data specific data traces that correspond to certain metadata so if you say I want all the spike trains from a certain unit that I recorded from under certain conditions then you can find out the data arrays in your file correspond to that because you have the links between the data and the metadata and you can also go the other way around you can say okay if you find a data trace that looks interesting what were the experimental conditions for that and this is something that only you can do when you've created this data set but also someone else can principle do because the links are there and everyone can use it it facilitates data sharing because it reduces the information that has to be given in addition to the actual data in order to make data reusable and understandable here's an example which shows how you can make use of this different information and you can store in the next file you can store not only the raw data you can also store the derived data in this example here the data where high density Mia recordings retinal preparation where spikes electrical signals were recorded on a Mia array and spikes were detected so this was a study where a spike sorting method was developed and so you have the raw data the voltage data and you also have the detected spikes and then what you can do for example you can pick out one channel with a few lines of code you can pick out one channel you can plot the the traces the recorded traces you can also plot the detected spikes here with the amplitude and you can also plot in the same plot the stimulus this was a stimulus light stimulus when the stimulus went on and went off periodically and you can do this with a few lines of code because all the data are in the same common format in the same file provided for the next format for example we have this data viewer which makes use of the features of NICS that you have not only numbers stored but you have really scientific data you have quantities and so this data viewer can automatically make a meaningful scientific plot with appropriate labels on the axis and you can also of course browse the metadata and then find the corresponding data to certain metadata so this is for manual exploration of data sets you can also use the NICS libraries in software for example the relaxed data acquisition software is using the NICS libraries to write the data so data and the metadata that this data acquisition system knows about are written together in a common format so this with this approach with the metadata format and the data format which are compatible because NICS uses this automated format we have a way to support and provide data and metadata collection in the lab basically from data acquisition to data sharing with access from many different languages and also make it easy to provide the data then without much effort without much further annotation effort at the point when you want to share the data so as the last point I want to talk a little bit about another service we have been developing over the past years the GIN service for data collaboration and data sharing you have already used it you have used the local installation you are also running a server in Munich where people can use this for data management for data sharing the idea is to provide something that supports the data organization working with the data version control of data and for that it uses the system uses Git and Git Annex the the web interface is a little bit inspired by GitHub and in fact this is based on a GitHub implementation an open source GitHub implementation where we have added the support for Git Annex in order to version control large files which are hard to version with Git it provides secure access and data sharing so you can share your data with colleagues for example who are not in your lab but again at the same time as you have seen you can also have a local installation in the lab and just use it for your lab the idea is to provide support for the data management along the entire data life cycle basically from acquisition to publication so if you use the system you can start version control your data sets right from the beginning typically when you make an experiment with raw data but then you start preprocessing for example spike sorting and this creates new data files and you can keep track with that system you can keep track of how your data changes you keep track of what file was added when you can work together and you keep track of who made which changes you can you will avoid data transfer because you are transferring large size only if it's necessary then if you go on to data analysis again you want to work together on it you want to keep track of who does what when you can go back to previous versions if you want to compare if something doesn't work in your data analysis it doesn't work you can go back to the state like yesterday or two days ago and find out what is the difference all these kinds of things that you can do that software developers are used when they are using github or probably many of you are also using github all these things basically you can do with data with gin and you also have if you are using the public server you have a way to access your data from outside the lab often it's difficult when you are at a conference or you are visiting in another lab then it's often hard to get to the data at home but you can with the gin server you have an access point basically through the web from wherever you are you can again give remote access to collaborators who are not in the lab and still keeping track of who makes changes and then when it comes to publication, when you have the data set there are your results your final results and you want to publish them you can easily make the data set available make it public and you can also get a UI for persistent identification and that's a citation so to summarize the points I wanted to make is that efficient data management is something that we all benefit ourselves so it should be in our own interest to have efficient data management in the lab it helps us to keep track of what is going on all the data, metadata, all the analysis it reduces the risk of losing important information so that the data set might become unreusable and for that open and machine readable formats are very useful because they can help to automate things also integration of or integrated organization of data and metadata is very useful because it provides the information of which metadata belong to which data it helps in data analysis it helps in achieving reproducibility and of course it also helps in sharing the data and with that I would like to thank the developers of the G-Note who have contributed to these tools and methods that I have been presenting and I would also like to thank a lot of people who over the years have contributed and have collaborated with us