 So, welcome to helping others use your data, lessons from the field of neuroscience. My name is Crystal Steltenpole, I'm the training and education manager here at the Center for Open Science and I'm joined by some folks from the Max Planck Institute in Frankfurt and also Neuroimaging Data Model and so I think maybe we can start with a couple of introductions and for folks that are interested feel free to use the chat to talk amongst yourselves. If you have a question for the panelists, please use the Q&A function. I'll help to monitor that and I'll get that started that way but first yeah let's do some some quick introductions maybe the Max Planck folks can start first. Maybe I can go for it. Hello everyone thanks for joining in for this very nice seminar today and thanks Crystal for introducing and starting off. I'm Praveen, I'm from working at Max Planck Institute for Empirical Aesthetics in Frankfurt, Germany. I am involved in the project so I'm working as a research software engineer involved in the project which tries to build a structured metadata framework called Genymede. You're going to be hearing more about Genymede very shortly. Hello everyone, first I'd like to thank COS for the invitation and also allowing us to present our work on metadata template here. I'm Zafan Chau and I'm doing my PhD as the moment of Max Planck Institute for Empirical Aesthetics under the supervision of Lucia Maloney and my scientific project focused on to understand how to distill the consciousness from the cognitive functions and so by the same time my PhD is undertaken and is sponsored and also supported by the Fair Research World Flow project and in this project we aim to develop the Fair Research World Flow that is not only fair for data but also fair for scientists and also fair for science and today for the metadata part is we want to show that part of work on the Fairness for Data part and thank you. And the NIDM folks? I'm Karl Helper, I'm at the Mass General Hospital. My group builds data management tools and for large-scale studies but also just for sharing of data and I'm currently working on the ontology part of NIDM. Hello, I'm David Keter, a research professor at the University of California doing neuroimaging research and then neuroinformatics research and I guess the history so Karl and I were involved in the biomedical informatics research network project in 2005 out of that came a working group on data sharing and how to model derived data and lots of people joined in from MIT, Satra, Ghost, JB Polin and so forth and over the years we came up with the neuroimaging data model is kind of what came out of that working group. So nice to meet you. Thank you. Awesome, thanks everyone. To get us started off in case there are folks that are new to the concept of metadata in the audience I was curious if folks would be willing to share what is metadata, why is it important or beneficial, what are some important considerations when thinking about metadata? I can go. So metadata is the usual definition of metadata is data about data so it's normally terms that are used to describe the data that you're collecting or acquiring an experiment or something like that and it's the data that's used to describe the data you've collected. The goal is really to have a complete set of data or metadata so that the data can be reused in a reasonable fashion so that you can tell as a data consumer that this data is the data that you're interested in and it's possible to be combined with potentially other data that you might have collected elsewhere. Just to chime in there I think one of the simplest forms of metadata is what people call data dictionaries. That's just simple definitions of your variables, what they mean, what the data types they are, what the ranges are I mean that's the simplest form of metadata that we'd like to encourage everybody to collect and share. And one of the main uses of data is to help, one of the main uses of metadata is to actually help reuse the data. I think this is one of the key terms in the word fair that is going around a lot and this stands for findable, accessible, reusable, interoperable and reusable and all of these are very very important for data so that not just data but any research object per se and metadata helps make research objects fair. So if it's not my turn so also just like what three of you have described very beautifully so having the metadata will make the data more findable. If you have the metadata of the data set then this metadata information itself it's just like once you have the dictionary it will be very easy for you to find the word you are looking for. I don't know if we have more time on this question because we can go a little bit farther in my view than simply describing the data set that you're sharing if you did processing on the data you know metadata might be what software did I use what version was it on what platform I mean all these details become exceedingly important and I don't know who the folks are that are you know out there I can't see you but you know depending on who you are I mean if you're doing brain structural volume research and you run the same software on different platforms with different operating systems you get slightly different answers and so that metadata if we're going to go and reuse the derived version there of the structural volumes you know we need to know what platforms you ran it on and even you know down to that level of detail would be fantastic and so that's I think what what most of us are shooting for in the tools that we develop helping you as users you know capture this information so that it can be shared with the data itself it's a really good point actually so capture the context in which the data was created so you're actually getting a bit to the next question that I was going to ask like what what need does metadata fill within the neuroscience community right so it sounds like under some circumstances it can be really important the the context in which the data was collected and and what software was used to process it are there other other other needs that metadata helps to to fill within the neuroscience community well as you say one of the things that when I give talks about this this topic one of the things they usually start out with is is a folder that says the word Thursday on it and and the idea there is that you know the postdoc has left you've got a folder that has the results in it and it's labeled with Thursday so you really don't have any idea you know in what context this was generated what point was they were trying to get out with hypothesis they were testing so metadata is really something that you're trying to collect and store with the data so that you have some sense of as Dave was saying the context in which derived data was was generated but also the context in which an acquisition parameters for which the data was acquired there's you can go even further and you can tag the data with metadata that what is this supposed to be used for like what what hypothesis is this testing because often that can be an important piece of information so that people can understand why the experimental paradigm what was what it was as well okay I always have things to say so in the context of neuroscience also alices so if you were going to identify if you're doing functional neuroimaging you might identify regions of difference between you know whatever the groups are in your comparison usually is an atlas to localize where those functional signals are those atlases are all different as you might have found in your own work and so understanding what atlas was used to identify the area or if you're identifying cell types you know what atlas was used to derive the cell type is important metadata when we want to combine data so this is like when you want to search so say we go out and want to search and get all structural volumes of the amygdala well the amygdala may be different in different atlases so once we get the data back we want to see the metadata to know what atlases were used here and there and we might be able to put in some you know covariates for that or transform things from one atlas to another so it sounds like it's also really helpful for standardizing across different types of studies potentially or at least acknowledging where there may be differences and definitions and yeah and that's that's a good point that's something I was talking about earlier is that if you want to combine data sets you know there's there's many data sets that are available with no imaging at the point which is what we do with them but if you want to combine those those data sets into an even larger data set so you have distance go power first whatever you're looking at it's important that the data that you know deck position parameters for example how it was collected so that you can figure out is this data combinable is it you know or am I just introducing more variants into the final result that I get you know I think it slightly going away from the neuroscience community I think it totally depends on the particular scientific community on what they think should go along with that data set if they find themselves using one particular aspect of the data then that specific distinguishing marker should be included on with this data as a metadata so it's extremely important that the community itself identifies what nicely describes their own data set and provides that as their metadata along with it and that leads to another important point this is this is what I work on is coming up with a common language a set of terms that the community has agreed means a certain thing and can be used to describe the the data so a simple example of neuroimaging is you know someone's done registration on the data well have you done linear registration on linear registration 12 degrees of freedom six degrees of freedom what you know what what does registration mean in the context and so it's it's very important to have a set of terms that people have agreed this is what the registration process is and here are all the different kinds of registrations we're calling them you know a common set of terms to describe these things so that when you look at the metadata that someone's generated that you actually understand what those terms mean and how that person used it because that's that's really the heart of how metadata can be used and be useful is if everyone understands that it's the same the same meaning yeah so I oh go ahead seven yeah so I think there's a cause point this beautiful to the point of easy to the importance of ontology in in making the metadata template so everybody could have different definition of registrations but once the community is adopting the metadata template that is based on ontology standard ontology existing out there that is well defined then that this this will make that metadata template much more much more accessible and it's based on the the community consensus so to verify my understanding in if I were to take an example potentially from education research and so people are talking about maybe defining the population that was involved in a study and they say something like middle school it would be important to define what middle school means right because it could in some areas be grade six through eight it could be in some areas grade seven through eight it could not exist as a concept in some regions so it's important for everybody within that community to to kind of understand like if we're using this term that this is what we mean yeah one simple oh it's sorry go ahead so one simple example is that age and in china the when we talk about age it's actually like normally what now for example in the west my age is 27 but actually now my age in china is 28 so but because people calculate age in different ways yeah that's it sorry that's a great point um in that it doesn't it doesn't mean that everyone has to use the same definition it's just that you need to know what definition was used for that particular data so if he says I'm 28 you need to know what calendar you're using or how that was arrived at so thank you um to talk a little bit about each one of your the projects that you guys have been working on um sephan I was wondering if you could tell us a bit about the cognitive neuroscience metadata template um and how this tool maximizes the the fairness that findability accessibility interoperability and reusability of neuroimaging data um so I I have the presentation or like or just a verbally answer you can talk or if there's something you'd like to share you can um it's up to you but just a few minutes maybe about about that about that template well yeah so um um so so the main point of my metadata template is just slightly different from the the metadata template of our beings proving together so my metadata template is actually mainly dedicated to the enhanced findable discoverability of the of the data set so right now so for example right now the scientist just have how about this how about this not necessarily being neuroimaging just like okay I have a hypothesis that boys are taller than girls and how can I actually find the data out there that actually enable me to make this the hot test the hypothesis then to test this hypothesis I need the concept the the or the measure of height and also the condition measure of gender the dataset of this kind but if without the metadata template the scientist can only just like brainstorming the mind oh which paper out there actually recorded this to date they record this to information that also open their data so then the scientists have to go to the papers that they he brainstorming his mind and then probably sometimes this paper that has should has the data set but it's not open accessible the other one uh mostly not open accessible and some some are accessible but this is very inefficient so um so I'm trying to put out the template in which we are trying to describe the experiments and also the state of the data set itself so that the people can actually just based on this metadata template people can actually easily search which conditions this experiments was conducted and and which kind of character is this data has what is the data standard something like this some thank you um uh Praveen can you tell us a bit about jennymead and how this framework can help researchers who are working with large open data sets sure um jennymead started off as um started off initially as when we when we tried to create an electronic lab notebook for neuroimaging so as Carl once mentioned previously um lots of neuroimaging data and all of the necessary metadata along with this data were written down in standard lab books and there's a lot of um um scribbling all around the pages related to various stuff lots of noise here and there and when you had to find out or search for a particular scan or a particular run you had to go through a lot of pages of logbooks from years ago um the initial approach was to try to digitize it and as we were trying to digitize it it became important to try to gather not just the the data related the metadata related to the measurement itself but a lot of information related to the context of this measurement so what was the project this this measurement this data was measured as part um who are the people involved there is always the researchers who are involved there is always the subjects who are involved now when we're talking about researchers um were they right-handed or left-handed could this have an impact on the measurement maybe maybe not but once we start to collect large amounts of data um it helps to have metadata like this which may answer future scientific question um at some point in the future um so generic neuro metadata descriptors that's what gename each stands for essentially tries to capture all of this context of a neuroimaging experiment together start off from a very high level of projects um what was the project what were the who were the ethical committee involved who were the people involved who very very low level of um describing things in space and time that is if there is a measurement going on in the lab what were the devices present in the lab because um the devices are all there but typically the main measurement device the m r or the c t or the m e g machine is what is noted and registered but it may be that there are other devices out there that are interfering or causing noise with this particular device and typically in a neuroimaging experiment there are lots and lots of devices all scattered about the room for generating stimuli getting responses back um tracking the movement of the eyes and so on now each of these devices may have an impact on the resulting data so it makes it it it becomes very important to try to systematically capture all of this information in a very structured manner and that's what Jenny meet tries to approach so when we're talking about devices it's not just the device itself but also how the devices are connected to each other so this then brings us to the wiring diagram part of it where we have 10 devices in the room um we also would like to know and capture in a systematic way how the devices were connected during this particular experiment um so we've covered the context sorts of sorts the project who are the people in more the research subjects in the devices and then finally we come to trying to gather information about what was actually happening in the experiment itself so what was presented on the screen what responses were provided was it auditory was it visual how do we um capture all of these now in this process we are trying to reuse as many terms and ontologies that are out there but wherever we find um missing pieces Jenny meet provides a placeholder so as a new researcher starting off in new neuroimaging as a as a PhD student say who's performing their first neuroimaging experiment Jenny meet would tell them okay you want to make your data set findable accessible uh interoperable and reusable start off with this there's a list of metadata fields you can start off filling which captures or tries to capture everything around that and um that's pretty much summarizes Jenny meet awesome thank you and um Carl and Dave can you tell us a bit about uh uh 90m and how it helps researchers find and reuse datasets yeah so uh Carl and I talked we're going to tag team that excuse me tag team this i'm going to start and then uh hand it off to Carl if that's okay and I wouldn't mind sharing just a couple slides might give some people some context and something to look at so you can see the slide okay so the basics of 90m 90m was built um off of the prob specifications which family of documents used to describe provenance in any domain really and it's built on the resource description framework semantic web technologies and allows you to make statements about data so that's basically what some semantic web and prob allows you to do you make statements about data this is that uh you know bills age 24 something like that and if you use the right vocabularies then everything sort of uh self documents if you will or provide you with metadata about what's going on because you can dereference all the terms that are used in these rdf documents they dereference to uh landing pages for websites and the description is there the neuroimaging data model was set up as a model using these core technologies to describe neuroimaging data in uh service of being uh findable uh retrievable and reusable and the neuroimaging data model has a simple hierarchy it's just got to project a session and these acquisition entities and activities which allow you to essentially put together very complex graphs of information about your data and about derived data um and so in order to uh make this work one needs uh beside oh sorry one more thing before we go to that you can then take these rdf documents you can serialize them into text in a variety of ways there's a variety of formats jason uh ld which is the linked data version of jason which allows you to take the keys in jason and link them together with uh properties um and relationships and so forth so it's a rdf is you know a very powerful thing um starting to be used by google and others to facilitate search on the web and so we decided hey can you use this for neuroimaging but what one needs to be able to use this for neuroimaging are complex vocabularies and uh proper data dictionaries so that we know what data uh is available what metadata is there what the variables are and so forth so uh we went about um so these these efforts over 10 years have been funded by a variety of groups including the nimh and the welcome trust and repro nim um i saw date kennedy on the um a list of people attending repro nim reproducibility in neuroimaging uh grant is one of the centers is one of the best places to go for reproducibility tools and techniques in neuroimaging and so uh what we did was basically uh started on this path tried to create the nidm the nidm data model tried to model some information so in the neuroimaging community there's this brain imaging data structure standard which organizes data i have a picture here organizes your data in a uh uh defined directory structure and file naming with file naming conventions and where some metadata would sit and that was very useful for the neuroimaging community to share data because now you know where to find the structural MRI for subject 21 in a directory when somebody sends you a data set but there's not a lot of metadata built into the bid standard and you couldn't search across bids data set so what we said was well nidm is good for searching um rdf documents have a variety of search engines uh in search formats like sparkle and whatnot you can do uh you know distributed distributed search um very easily with rdf so what we'll do is we'll take this uh bid's structure and we'll model it with the nidm uh data model and put a little document at the root level that's this rdf document then we can search across these data sets by searching uh using vocabularies and relationships um and so before i i'm not going to get into i'm going to let carl talk about that because it's much his work on terminologies and ontologies relating to nidm but uh i just want to give a shout out to um the uh let's see sorry um the nidm terms grant which was from the brain initiatives and n imh funded to create such vocabularies uh and ontologies surrounding uh involved with neuroimaging data because what we found when we went out to the ontologies and terminologies that were out there either the terms weren't properly or well defined they just had a description but no units or you know no extra information or the terms we needed were just not available and so that's where carl comes in with his expertise on um helping us in this nidm grant to develop these neuroimaging terminologies so that we could use them within nidm so that we could convert data sets to these metadata documents these metadata documents then go live in a metadata or a graph-based database it can be distributed and now we can do searches across data sets they find me all data sets that measured age measured uh depression and had uh you know patients of a particular type those with major depressive disorder diagnosis or something like that and we don't care how you measure depression and we don't care how you got your diagnosis at this level we just want to find out how many data and that had structural neuroimaging how many data sets are out there let's get those data sets back now we can look at the detailed metadata why it's so important and find out how did they measure depression oh they use the bdi over here oh they use something else over here and how did you make your diagnosis of major depression and then for my experiment is that sufficient or do I want to reshuffle the the subjects okay that's what I had to say uh I can't find the stop sharing button there it is all right thank you carl yeah so um and the other thing that Dave mentioned is you know there's collections of terms out there but uh sometimes there there is no definition so um that was another problem we ran into um so the the other thing that's important with the work we've been doing is really to um make it possible so that people can annotate the data to some arbitrary level of detail and it's given the flexibility so that if the exact you know with uh diagnosis method that you use for depression is is important for what you're doing um that you have the ability to put those um that that information in there so some of the work that I've been doing we started out using prob which is a very simple ontology that you can easily model workflows and essentially that's a lot of what scientific data is it's just one giant workflow from acquisition to the the data use and and so that that seemed appropriate the the issue with it is that um there are many more terms that you would like to model that are not modeled within prob and it becomes a little bit difficult to reuse some terms because they're not built on the same philosophical underpaying so um my current work is really to take the initial 9gm experiment terms that were based on prob and move them over to the basic formal ontology bfo and um build a scaffold based on bfo and uh the ontology for biomedical investigations and then uh build a scaffolding that we can put our domain specific terms on so because of that then uh if you have this the scaffolding that people can use and then the community can contribute terms and um grow the scaffold for their particular domain but that's that's what essentially what we're trying to do is allow other people in different domains to build on this as well so thank you and we did get a request um if uh folks have links to their projects would they be willing to um post them into the chat and and make sure that um it's going to everyone not just the hosts and panelists and then we can also include that um in the email uh that we send out after the webinar along with the recording folks are really interested to to dig deeper into that um so thanks for sharing that I'm I'm curious in in all four of your uh work what are some of the challenges that have that exist or maybe even that you've specifically run into for coming up with field wide schemas or templates for metadata um if I may start first actually the biggest problem is because we want to make sure the whole metadata template is machine readable and it's based on existing ontology um so for example in cognitive cognitive neuroscience and cognitive psychology when you say um backward masking people might sometimes even call it sandwich masking people just call it masking um so we must when we want to describe the experiments for example using the terminology of the paradigms then we have to base on existing ontologies that they define what it is what we mean by one paradigm but the problem is in the in currently in my field it is the the the ontology out there is very sparse and there has been very little efforts that has been conducted to develop ontology and some there has been some extensive systematic efforts that try to develop ontology but um it was not also well maintained so the biggest problem is ontology problem and as long as we have the very profound ontology that's described experiments profoundly and we will that's will make the metadata template building work much easier yeah thanks uh John on those points um the maintenance of the ontology is is extremely important and that's one thing we found you know there are ontologies that talk about imaging but they may not have been updated since year 2000 and technology moves along right and the way people understand um cognitive tests are different and and the field moves on and so it's very important that the ontologies be maintained um but it's also important that that we have some kind of understanding of these sub domains and subcultures within a domain that you're trying to model so different subcultures uh five can use that term have different ways of using terms that you have to be cognizant of before you can you know make it something that everyone will use the other thing is there's a difference between um what we sometimes call personal data elements which is what you call things in your lab and um more common data elements which is what everybody agrees we should call this so a lot of times what you're doing is you're building spreadsheets in the lab or um and you've got your own little system of how things are called there but then me someone needs to do the mapping from that you know those local variables to uh see what they're saying to the ontology to something that everyone can understand and that can be a challenge for the researcher so it's really important to get people aware of the ontologies but um so that they can just use those terms right right from the beginning that would be the optimal thing because everyone understands right off the bat what what these things are or at least to build tools to make it possible for them to map their um you know their existing data um to an ontology that everyone agrees on this has this has led to a couple of questions um from from our audience members um uh one is are there official ontologies and are these ontologies versioned at all there there is um there are official ontologies um there are uh repositories of ontologies that you can there's one for the all the ontologies that are based on the basic formal ontologies for example um you can you can find those easily on the web and these are version um one thing to look for in in reusing ontologies or terms from ontologies is whether they're being maintained a lot of ontologies are moving to github so that um issues can be you know discussions of terms and issues with terms um can be discussed and then that's the great thing for the user is then those discussed with our public and you can see why certain decisions were made which really helps you understand the ontologies using that team science approach exactly um we've also got a question um about it they'll just read it a lot of discussion rightly is about why metadata capture is hard particularly for the end user um what are some sorry particularly for the end user what are some current ideas about how this can be made easier right i totally see why um i totally understand the question and um i always like to think of um putting in metadata in in the spectrum on one end is easy human friendly metadata on the other end is machine friendly metadata which is and should be the ultimate goal and the more we tend towards machine friendly metadata the harder it gets and a simple log book is very human friendly people can write it down but uh creating a complex ontological ontology based metadata structure makes it very difficult to put in the metadata and people typically scientists typically don't have the time and don't think it's important and enough to put in the time and effort to actually put in all of this metadata in there so it's the i would say it's important to find a sweet spot somewhere in between and um tools particularly software can actually help make this easier so abstract away all the complexities involved in creating rdf files for example or json ld's and try to make it as easy as possible for people to put in important metadata to describe the data at least the minimum set of metadata to describe the data and um yeah i think this is still work in progress and every community tries to approach this in a different way and um it is uh it's a difficult thing to actually get all of this metadata in there yeah that's a great point there's two things that are important there is one is that the field comes up with um an m is a minimum information standard so that when things are you know data sets are submitted to a repository or to a paper that that there's a set of metadata that accompanies that um and that people generally agree that that's what you would need to understand the data the other part that my group that works on too is is building tools that that work as close as possible with either the acquisition computer so that as for being said you know you acquire data and all this stuff just gets written into a metadata or all the acquisition parameters just get written to a metadata file so that you don't have to you don't actually have to do anything um we found having people not actually have to do anything is the best measure of success i have for um and the greatest uptake in the in the community but that those two things i think are really important yeah i was going to say um we've spent um a good amount of effort in the neuroimaging community to build tools to help users do this uh is also why the brain imaging or bids data structure was created because that's the sort of easy entry point for humans to kind of organize their data in such a way that then we can build tools on top of that to allow you to annotate your metadata or help you create data dictionaries and so through re-pronam through the 90m work we've got even a grant that's being reviewed on building user interfaces to allow you to create data dictionaries for your data whether it be in comma separated values files i mean it doesn't have to be neuroimaging in that sense allows you to create these data dictionaries that have sufficient properties so that they are reusable and then um other tools to take bids data sets and create an idm documents uh to take data that comes out of software that analyzes data and uh grab metadata and in our case put it into an idm but another aspect of this is to somehow to get the tool developers to write out sufficient metadata in a structured form that then can be reused by all the groups here on the panel and everybody else whether you're using 90m or some other you know metadata representation and so we've put in a lot some effort anyway into how do you get tool developers to do that one way is to create libraries for them and give them working library that they can incorporate in their software but hopefully um eventually the software developers will see the value of structured metadata and output that from their tools so that we can use it so i think it's it's all about the tooling and then it's about the character the stick how do you get users to actually do this uh the users who have tried to reuse data in the past there uh they'll do this because they're like yeah i had a csv file with uh you know i mean columns i had no idea and i can never find the person who created it so it's totally worthless um and but the younger students you know we're uh through rebranding i'm doing a lot of reproducibility training seminars for the young students to show them from the get go here's how you do you know what we think is proper research documenting your your data sets and so forth so they are reasonable in the future so maybe the next generation it'll be much better than the current one yeah it's great that you mentioned csv files because we've got a couple of questions about um software and how to capture metadata um so um you know we've got a question about do you always need software to to capture this and um what what recommend date what recommended software would you or what software would you recommend for for capturing uh metadata what are some of the best ways to to do that so that you can then share it with others yeah in our field um that's a little bit of a complicated question um there's there's it helps if the field has a standard data format so and um there's two standard data formats in neuroimaging there's the acquisition format which is daikon there's the the format that everyone then converts the daikons into to do the processing which is nifty um so so it helps just in general if if those things exist because then people can write tools um whether or are there tools that that consistently pull metadata I think the answer now is no there are a lot of tools out there but it is there one tool that that everyone can use in all fields that that just doesn't exist yet mostly because then the uses of the metadata are so varied this that's kind of like the holy grail of of uh of metadata I think yeah I mean we have with the pine idm tools which there's a link to the api in the chat um there is a tool csv to idm and but what that I mean it's it's low tech I mean these are all research projects and in development so as I mentioned we have a grant that's being reviewed to develop a proper web enabled user interface this is a command line version csv to idm but what it will do is allow you to take any arbitrary csv file and it will pull the variable names out of the header it will iterate over those variables and allow you to define uh the properties for your data dictionary for example and then it may it'll also allow you through some complicated mechanism it'll go out to some ontologies or terminologies and allow you to attach concepts to certain variables these are higher level descriptions so that age is a measure of you know I don't know time since birth and and it doesn't matter how you know if you did it in years or months it's prenatal postnatal at that level it's just about being able to query across the data set so this kind of tool will then output an idm file but you don't necessarily care about that it'll out about output also a jason um uh data dictionary file and that can be edited in any text editor so it kind of gets you into a place where you have some kind of data dictionary that has properties like units and ranges and things and potentially it also lets you if you have categorical variables to find the category so that's kind of a simple way of getting in to the game and at least getting a data dictionary for your csv file that you can attach um there's there are other tools I mean one could even use red cap the database right and put your form in there and you get a data dictionary out because you have to describe all the variables you know um but there are there are some for excel too I can't remember the tool from the open science crowd um but they're not general across all domains they're usually very specifically Carl said so right now in our team we actually develop an metadata template to use in sidar um it's like yeah so the sidar is a platform basically the metadata template uh we are building this metadata template separately just like a form and uh so um we are building this metadata template on sidar because it enables us to actually connect the entries with um existing oncologies out there so that is the why we actually choose sidar and basically we are building this form that could connect oncologies out there and this form actually in the end sits now our plan is sits as some data repository for some for dryad we are now having collaboration with dryad and the potentially uh OSF as well if possible that's in the end like people every time when people upload the data sets they can manually fill in this metadata template and to describe what happened actually in the experiments so I'm curious about what recommendations you would have if there's somebody perhaps in a different field um is is working on improving metadata related practices in their own field what what recommendations would you would you have for them in getting started in that work uh by in my view one uh so it's it's about consensus within your I think a lot of this is sociological and and it's about consensus in your field amongst the people so you know I would encourage you to get together folks that are interested in that topic in your field put together monthly webinar monthly calls uh that where you get together like a working group and you sit down and decide what you know what do we think we need in our field and then you know you go off and create the different components whether it's tools or vocabularies or whatnot I mean that's a really good way to go I think because you know you have friends you have a team you're not doing things in a vacuum because this is about data sharing you really kind of need it to be open and you need to get buy in from other people in your domain so I'd start there and as a pre-step for that I'd say look around see what's out there because there may be something you can already reuse or there's a starting point that you could build upon and just to pitch in once once there is a starting point or a common set of terms agreed upon it helps to publish a guideline or a template out there so the people can actually refer to this and point to this as a stepping stone for future work in the community so it always helps to have a guideline published where people look to this guideline when they want to release their data sets yeah repro nym is again I just put the web link for repro nym I forgot to do it before in the chat window it's a good I think it's a good resource you can ask the repro nym center for you know their guidance in these things they do webinars all the time and training for folks and they have lots of good tutorials on metadata and things and I think they could be a resource for you even if you're in a different domain to get some ideas and how to start and where to go so I'd encourage you to reach out we also have a question coming in from the q and a is it possible to share too much metadata yes especially that especially when you're dealing with human subjects I mean there's there are definitely personal health information rules and and HIPAA compliant rules and there's EU standards and so one does have to be very careful about what one shares and those things are all documented on the NIH website for example and yeah so when I'm developing this metadata template because it's actually required the potential users to fill in this metadata template by themselves manually so that's actually some some kind of to some extent it's also some kind of the actual work so well so like one of the one of the one of the aspects that we're trying to be cautious is actually not to make the metadata template too long but at the same time as informative as possible but yeah that's that's kind of a struggle but also we actually have I'm curious also how is working with metadata affect you affected your personal research workflow like how you conduct research how has you know centering metadata affected that if at all well because we've realized over time that data is used to be required data it would sit somewhere that might or might not be accessible to other people someone might know that you have the data you give it to them and give them some general instructions but now because of these sharing requirements we realized and I think most people are starting to realize that it's much easier to share the data if metadata has been attached to the data at the beginning and this is what Provena was saying right like as early as possible get the metadata into some sort of system so that it follows the data around because doing it at the end it can be very painful especially if you know someone's writing in a book and trying to figure out what what it was that they did six months ago so what we've really worked on is think about at the very beginning of the experiment already be thinking about how you're going to share the data and what metadata do you need to provide to make that data shareable and always think about you know where I like to tell my students is think about writing the methods section always keep that in mind because that will help you always make sure you have enough metadata to describe the the day you're taking to other people and so if you've done that in the beginning then sharing would be a lot easier at the end and that's one of the things that's why we're kind of doing what we're doing assuming your metadata or your method section has all the details there's a lot of papers that they don't have all the details you need all the details right exactly I mean my dream is when the method section of future publications are just JSON files or any other structured data so there need to be no text in that just information in my personal experience I wouldn't say because in my field right now personally my degree is focused on cognitive neuroscience but right now I'm also conducting a lot of analysis in the behavioral data but there's also actually no existing method didn't template out there actually describe the experiments so mainly my experience is just feeding the pain so actually in my field there's absolutely never any metadata template but recently there's just one guy who just collects all he studies competence and he just collects all the data sets writing the author of these papers that have collected confidence ratings and he put all these data sets together and put all the notes together in one CSV file and just like there's no proper metadata like in there but but in the end I recently I found a research question and I feel like maybe some data sets in his data set in his database is usable and actually going through there in the end indeed like I found like I managed to replicate my results on 13 independent data sets in the database and that actually gave me the feeling oh my god when we really every data data set published out there has a proper metadata template like methods of information described in experiments I will be able to find out more and we can actually save so much money collecting new experiments and if I want to collect the data of these 13 independent datasets within my PhD I will need at least five years I love that because I think that we've moved in in our field in anyway neuroimaging beyond what can be accomplished with one labs money and data you know you really need larger data sets to ask the questions about subtle effects now and that requires us to reuse data and it should be reused right you know all the people in the world who pay money to the governments that give money for us to collect the data I mean you know all this money is being used to grow to collect this data it should be able to be reused for science into the future and combined and that relies on us to at the beginning when we're collecting it annotate it properly so that it's usable I think it's the burdens on us but that's for a good reason I think that leads nicely into my last question for you all today I was hoping we could end hearing from you all about something regarding what's something regarding metadata that you would like to see happen within your field within say the next five years and feel free to go pie in the sky if you'd like I think I've already mentioned this method section that is not text but anything structured structured metadata instead of a descriptive text in the weather section yeah I I think the you know the push to make data available as a precondition for getting in publish has been a really great thing for our field and I think now it the burden is kind of on us who have been pushing for this to develop tools so that the average user in the in the lab can can more easily attach metadata to their data because it's it's a learning process for people you know like oh what do you mean I have to do standard terms I've always called it this and you know now you're telling me you have to do this this so I think we need to if we're going to ask people to share the data and we've demonstrated the the usefulness of shared data then it's really up to us to develop tools to make it easier for the end user I think I I'm not a very optimistic person but I envision that's in the five years in the following five years people actually start to realize the importance of maintaining developing ontology once the ontology is there metadata templates is a very easy problem I think maintaining is the is the keyword that because there are currently lots of ontologies out there and eventually they fall into disuse and disappear yeah just like how people maintain Wikipedia so I guess maybe my add on to the five-year thing would be is that funding agencies also realize the importance of ontologies and maintaining ontologies so that they don't just fund the development of the ontology but they recognize the importance of the ontology to the field and importance to their the sharing of data that they've required and the money they'll I I agree with that everyone said it's like it's great I mean you need yeah you need you know journals to be requiring that data be shared and open with the publication you need structured methods sections like some of our panelists said and others commented in the chat window you know we need funding for tool development in this domain and for data science tool development but also like Carl says you need additional funding to maintain it I mean we all have to put in these sections about how we're going to you know keep our tools going into the future without you know funding from our various agencies I mean I don't know how you do that I mean it you know you need funding to keep these things if you want to make good software that we can reuse and good ontologies that we could reuse I mean the funding's got to be there to keep them maintained into the future and I don't really know how you do that we we try and do the things we do collaborating with people trying funding over here a little bits over there but it's not how things are done in the commercial world and we have good software tools from commercial companies because there's money behind it to maintain them to upgrade them I mean one library changes and it breaks everything right if you use that library you know you so yeah we need money for maintenance anyway those are my comments awesome well thank you thank you everyone who's in the audience for coming today we hope that you you enjoyed this discussion thank you to our panelists for sharing your expertise on metadata and and advice for folks that are hoping to to enter and continue to do this work and thanks for the Center for Open Science for for hosting this we're very happy that you all were able to join us today