 Good morning, hi. I'm David Millman. I'm from New York University. I think this microphone doesn't work. I will just talk louder. Can you hear now? Sorry. There's something making others an error thing in the middle. So hi, I'm David Millman. I'm here with Rick Gilmore and Dylan Simon to talk about the DataBerry project. DataBerry is a digital library for sharing video. Our current focus is in the behavioral sciences and our current focus is on a national scale size library. We've had inquiries from other disciplines and we've had inquiries from international colleagues because a lot of the infrastructure for doing this kind of data sharing applies in other settings and you'll see that a lot of the infrastructure for the policy and process also applies in other settings. So we're not sure where this might lead us. So I'm going to talk for a minute or two just to give you a broad outline and then I'm going to turn it over to Rick to talk about how we're implementing the data sharing process and the protocol and then Dylan will walk us through the data model which is pretty flexible and it'll show you how we're organizing a collection that varies really widely. We are not quite open for business yet so just a word on our status. We plan to have the first release available in February so we're almost open for business a couple of months and we have a number of contributors who are already lined up and ready to deposit materials so we're almost launched. We're small. There are only a few of us on the team so far but I want to recognize a couple of people. First the principal investigator of the project is Karen Adolf, professor of psychology at NYU and I'd also like to recognize Lisa Steiger, the project coordinator. Karen can't be with us today but Lisa is here. Can you wave high and Lisa will probably bail us out within the next 30 or 40 minutes. So these are the high level goals of the project and today we're going to focus especially on the first one the repository building and also the fourth one about the policies that enable data sharing. That's a fair bit of material and we want to leave time for questions so we'll see where we are when we get through those and then maybe we can backtrack and talk about the other stuff too. We're showing you all of these goals because these were actually the outline of our funding proposals so we're funded by the National Science Foundation and also by NIH. The oldest of those is the NSF award and that's a little more than a year old, about 14 months something like that. So what our intentions are are not going to be surprising to you all but I want to emphasize that just a couple of points about it. We're about sharing and not just about compliance so there are a lot of these kind of repositories where if you deposit there you're compliant with the NIH guidelines and then you can move along but those repositories don't necessarily encourage or find themselves easily shared and we're all about that so we're thinking about the use cases what kinds of things people would be able to do with our collections and we really want people to use it. The other thing I want to emphasize about these use cases is a lot of them come back to haunt us as consent agreements, policy, and release kinds of language because these are all or most of this is about what kind of things you can do with it and therefore about what kinds of permissions we're going to need to get from people to be able to do that. There's a bunch of these kind of trade-offs. Every time we enable something new we find that there are implicit challenges that are implied by doing something new so just a couple of them. One thing that's really neat about these collections is that the data is video and it's people participating in experiments and so you don't have to do anything with the data to use it in another domain which is just common sense but that's not the case in a lot of other research data sets so for instance if your research is collecting video for motion studies then somebody who's doing language research can use that same video immediately without doing anything to it they're just interested in a different aspect of it you don't have to recode anything you don't have to touch it at all. The flip side of that is that this data is pretty identifiable too for the same reason. It's video so it's very hard to de-identify or anonymize anything. It's pictures of people and so you can tell who they are. Another challenge and opportunity at the same time is that it's the structure of data varies really widely because researchers in this field and researchers in most fields have really different ways of organizing their material and it sort of depends on the purpose of their original research. That's immediately in contrast to trying to create a consistent metadata schema so that people who want to consume material can find it easily so we find ourselves walking that line all the time too. Another double-edged sword is that we've tried to encourage open data sharing but how open can it be when we have all of these privacy issues around it and so Rick is going to talk in much more depth about what we need to put in place so that we can encourage the appropriate level of sharing. We're getting into different kinds of territories with our institutional review boards for instance and in a sponsored programs offices and in a way that they're not used to doing business with us whereas the project otherwise relies on infrastructure in ways that are probably more familiar. It uses a lot of central IT. It uses a lot of the library for things like metadata management and preservation but it also is in touch with the IRB with the sponsored projects office and with University Council to a much greater extent than this kind of project would normally do so I'm just going to stop there and you'll Rick and Dylan will give you much more depth on the rest of it so Rick's going to talk about how we're enabling the sharing through policies and permissions. While we're in transition are there any preliminary questions that came up through David's section? No? All right so what we're our task is to create a set of relationships that allow Databrary to share identifiable data in a way that preserves the promises given to research participants protect their privacy. So that seems like a pretty tough nut to crack and it was one of the nuts that we knew we were going to have to crack for this to work. So before I tell you how we have how we think that's going to work let me tell you how or refresh your memory about how research with human participants now proceeds in a typical context. So in a typical context there are relationships between an investigator and an institution right that are bound together by an IRB protocol and ethics certification. So all researchers who conduct human participants research have to demonstrate ethics training that training is supervised or provided by an ethics board or an institutional review board and then when an investigator chooses to conduct research which has a specific definition an application is made to an institutional review board and approval sought. There are situations where approval is waived or different levels of approval but there are formal relationships that bind an institution and investigator together in order to make sure that research is carried out in a way that is ethical and does not harm human participants. Of course the other side of this equation are the people who are participating in a variety of research studies and the fundamental notion there is one of informed consent. So how many of you have participated in a research study that involves informed consent procedures? Okay several many people have not all have in our case many of our cases in my laboratory and in Karen's laboratory other laboratories what's involved is actually bringing children in my case in Karen's case babies and their parents or caregivers into the laboratory and in that situation parents or guardians give consent for their minor children. So that informed consent the notion of informed participation in research comes along with it a promise of confidentiality so I promise to research participants that I will not reveal their identities and even if their research study involves nothing harmful at all. So that those important promises are part of what we have to find a way to maintain as we move forward with data sharing and you might think well that seems to be antithetical to the notion of sharing so but in fact we think it's not. So here comes data worry we're the new kids on the block how do we change the situation and again our our insight was rather than to reinvent wheel a wheel of any sort but was rather to build upon the existing infrastructure and strengthen it. So our innovations our our notion is to ask participants who are engaged in research whether they give permission for their session that's been recorded to be shared. Very simple extension of the notion of informed consent and many of our colleagues in our field actually ask research participants now whether the recorded session can be shared in educational scientific context with others for illustrative purposes. So there's already a foundation and framework a framework for that that exists. Then we couple that with an institutional agreement or an investigator agreement excuse me between the institution and the investigator and data brain I'll have more to say about that. That enables that research protocols information to be shared with data brain. So when a new investigator arrives and wants access to data brain resources all that new investigator will need to demonstrate is that they also have ethics certification that they're with an institution that is supervising their activities they have to promise to treat the data ethically to maintain a confidentiality of the research participants and sign an agreement that binds them to data burry in that way. And this is for are we envision this in the context of a situation where one of the use cases say that David suggested say that that a hot new study comes across your email yesterday about babies understand social relationships in a way that they that we didn't think they did before and you'd like to see the result yourself. So if perhaps the finding was published in an outlet that allows supplemental material but supplemental materials are often limited what if the full data set the full videos and coded videos were actually available on data burry as an investigator I could go on and look at the videos and decide for myself whether I believe the finding or not. Perhaps I could even look at the methods and evaluate them critically that's not really research right it's sort of it's what many institution review boards think of as pre-research or it's scientific activity nonetheless must be carried out in an ethical way that respects the privacy of participants but it's not exactly research and therefore may not require a formal research protocol and our belief is that many use cases for data burry will not involve research they'll involve pre-research activities scientific activities that that may not require institutional review board approval now of course IRBs were designed to be the local review of research activities in the diverse set of communities in in our country and so IRB standards vary somewhat so what NYU allows Karen Dillon Lisa and David to do may be different from what Penn State would permit me to do but we think that that for in many situations access to the to data brace sensitive resources can be limited to those individuals who have this sort of ethics certification and not require be certification however when the activity that the investigator wants chooses to do involves research the next step would be to seek approval from an IRB and that would be the arrangement and the relationship between the investigator and data burry would already be established so just a review for you again our our innovations or insights here were and perhaps insight is too strong because what we're essentially doing is building upon the existing infrastructure is to seek permission to share from people who are depicted in recordings which extends the idea of informed consent and restrict access to recordings that have for which those individuals depicted have given permission that includes authorized researchers who have ethics training and researchers who agree to maintain privacy now there will be we will say shortly there will be aspects of the database that will be open to the public but that that requires its own range of sharing so how do we enable this process well the first the first mechanism is to create template language that other investigators can adopt add to their existing research schemes that embodies these principles and so we've created template documents and they're on our our website and also we have a github repository which I will give the url for here shortly the template and we won't go through it line by line but essentially it points out to the research participants that that permission to share is not the same as participating research and that was important for for our research community who was concerned that asking participants to do one more thing might in fact be the straw that broke the camel's back and make them refuse to participate talk about data privacy concerns who has access how long our goal is to store the information indefinitely and that actually poses a challenge for some researchers in the sense that even for example my own protocol that I have modified recently has a data destruction clause I've promised to destroy recordings after five years beyond the project deadline and many IRBs encourage people to destroy data which of course is antithetical to to the goals of this project that there's no compensation there are special provisions for having a scent from minors and describing different levels of sharing so here are the levels of sharing that we're asking people to agree to the first would be none right so no sharing has to go forward and though that permission that particular level of permission as with all of them has to be then maintained with those records throughout the lifetime of the information shared but again our the shared data would be shared only with authorized researchers who've met those ethics and other criteria these are individuals known to data brewery their identities are known there's you know anonymous identities within the system so it's not youtube or reddit or tumblr there's another level of sharing which is excerptable what that means is that individuals can get permission to share videos and they may also get permission for sections of those videos what we're calling excerpts to be created and shown by authorized researchers to the public so this would be allow in the right circumstance for the use case that I mentioned before perhaps a video that that illustrates a given finding might be associated with a the public the publication of a particular paper and so forth or may not be so may not be particularly notable but one that illustrates that the range of findings and then finally there's an open level of sharing which would be a sharing with the public that we at least contemplate as being something we will support although we we're not certain how how far that will go all right we need to record those levels of permission accurately right now this is a heavily paper based system in laboratories so there are paper consent forums there are paper video release forms and so forth so we've spent a lot of time developing a template form that records very clearly what those permissions are on paper and that includes and this is different from a research study it includes anyone depicted in a recording and that's important because some of our colleagues record in natural settings home settings where the focus might be the infant or child but in fact other individuals are depicted incidentally and their permission must be sought as well and in order to share we need to essentially follow the lowest common denominator if if any one of those those individuals depicted says no we have to follow that that that choice explicit yes no boxes and differences for minors and adults are part of our system we have to get this right getting permissions right is fundamental to the enterprise and so we're in the process of planning and we'll be moving very quickly to a system that electronically records permissions that's specifically linked to session of participant level metadata and we think this is critical for avoiding data entry errors and some of these cautions have come from our initial efforts to curate some data sets at NYU we realized how easy it is for permission to be to be recorded inaccurately now of course our default will be none but we need to we need to get this right and so we think that an electronic system is on the on the roadmap to to doing so possibly through some sort of spreadsheet template possibly something slightly more sophisticated as a web-based permission system this is a small part of what we're what we think is going to become an increasingly important aspect of the project and that is a system or set of systems for laboratory data management that and the self-curation of scientific data that will make it possible and easier for researchers to share the more we go the more we build on the library the more we realize how important it's going to be to have those scientific products organized from the very beginning in a way that would enable them to be shared so we think that this system is better and I want to say why we think it's a little it's better we think it's better than the existing video releases because the language about the level of risk is clear and unambiguous we make it clear that the consent to participate in research is different from the permission to share data sometimes those are commingled and others in other contexts we think it's easier for participants we think that we that we communicate a more realistic conceptualization of risk and that we standardize this across contributors via templates and I should say the realistic conceptualization of risk stems from our realization that many individuals these days in scientific or educational settings carry with them a digital video recording device how many of you have a video capable phone or twice in your pockets right most of you do so in fact if I were showing private information here in this talk and you you could just hold up your phone and you could record this session and that information is already in the public and the same thing is true in any class where I teach or the scientific meetings once it's been shown in a outside of a laboratory setting it's in the public and any guarantees otherwise are very difficult if not impossible to maintain so that's why we made that simple distinction no sharing sharing sharing plus excerpts that toss it conveys the the risk in a in a clear way and at the same time we have confidence based on our our experience with video release forms that many participants will agree to sharing plus excerpts so that we're not we're not talking about a small fraction certainly not a hundred percent but a significant number so in order for this to work we have to build a user community that means that our colleagues need to become authorized investigators we need they need to we need to design a registration process we have an investigator agreement that I will be talking about just here shortly which covers data contributions these non-research and research uses and we're also have been encouraged to work closely with our institutions as David mentioned to get institutional sign off on this process so that so that sponsored projects offices at Penn State the legal office is reviewing it as is the risk management office we want to cover all of our bases to make sure that investigators who might be participating in data brewery are known to their institutions and that there's a level of mutual trust so with respect to this investigator agreement where research conduct is concerned what we've essentially done on the other side is create a document an institutional agreement we're now calling it a data use contribution agreement that essentially describes who does what at different stages and for different levels of use so whether it's access to data what the responsibilities of the investigator are which applying for training what the institution's responsibilities are what data breweries responsibilities are and what these components are I just sort of point out that data breweries main contribution is to maintain the sharing permissions that have been provided to us by the investigators we're browsing your viewing data is concerned everyone has to maintain our privacy of the data data brewery and the institution with a debris will be keeping information about data who's accessing what kinds of data we may share those with institutions and importantly based upon advice from some of our advisors who have been pioneers in data archiving we've been encouraged to ask institutions to take responsibility in the event that their ethics breaches so our institutions will be a promising to treat violations of data very policies as research ethics violations at their home institution and take that responsibility on so sometimes informally among the group we talk about extending a web of trust which exist which currently exists between an investigator and research participants and we're extending that web or that network to include data brewery and an associated set of investigators who have agree to certain ethical principles and agree to abide by them with the permission of their institutions so this project is an open source project from the very from the get go it has been since before we even got the grants for data brewery our our video coding tool which we will not mention is free and open source and you can find it here but all of our policy documents are available at data brewery on the github site and also on our website we welcome your thoughts or contributions or suggestions that's how this thing builds and that includes not only the two documents that I talked about but term definitions our philosophy about data sharing and investigator rights and so forth and best practices so now I'm going to turn the presentation over to Dylan who will talk about the data model for the kinds of diverse data sets we expect to be curating so before I start I just want to say that in building a data model for data brewery we really wanted to start with a very simple lightweight model that would be enough to represent the kinds of data that researchers have and also to allow discovery and so obviously a big part of this website will be finding allowing researchers to find videos that will be useful for them and so that's really where we're starting here at the same time we need to realize that we don't want to impose any strict ontology on the kinds of data that people have and also there is as far as we know nothing in the field that researchers are actually following in terms of standardizations for storing their data or for representing it in fact lots of people are still using pen and paper for a lot of this data so when we started we started thinking about this concept of study and this is a very common concept within researchers in terms of representing a particular research enterprise within a lab but as we followed along with this we realized that actually different people mean different things by the word study for some people it could be a single statistical analysis for some people it could be a multi-year project that could actually generate many research papers the other problem we found is in fact the definition of a study can change over time so if we decide okay this this package of data is going to represent a study the researcher could later decide actually I want to split this into two studies or add more data to this study and so we need to so this study concept has to be very flexible instead we start we focus at the beginning on the raw data themselves so so within what we call a session a participant often comes into a lab you have a video recording that video recording is going to stay pretty much as it is for the lifetime and then you can do further kinds of analysis and research on this video but the video is really what's important for us and so this creates this concept that we're calling session and a session is simply a period of time in which you've collected some amount of data it's usually a single video but it could be multiple videos collected at the same time or overlapping there could be other kinds of data that you've acquired audio data eeg any any raw data that was generated in that session will go into this concept of a session so for us a session is really defined by the the date that the exact time in which that says that session data were those session data were collected and also very importantly as Rick mentioned the release level that the all of the participants have agreed to and that covers that will affect the permissions applied to the study but also of course we'll have all the files and other kinds of metadata which I'll talk about soon so this is probably pretty straightforward for you for an audience like this but to sort of represent this to our users we say a session is kind of like a folder and you can think of this as having a timeline you can add your video files and specify where in time they were they were taken if you have overlapping files if you do other kinds of data collection you might have a document that you scan in you can add that to this container and then these last two rows on the bottom are different analyses you might later go back and add so you can do what's called the coding pass of a video and that specifies some sort of quantitative analysis of this video and add that back on top of this session later so for each file in a session we store pretty basic straightforward stuff just a name and description and this is just something for the researcher to identify how this is how this is different from any other files that might be in the session of course we keep track of file formats and for now we're only allowing particular sets of particular types of files that that we understand and in the case of video or the time series data we also store of course the length of the video and the position in time as you saw on the previous slide and most importantly for every file we also need to know whether it contains identifiable information so videos with faces do contain identifiable information whereas something like motion tracking information is not necessarily identifiable or even a video that just includes somebody's hands is not counted identifiable in terms of HIPAA requirements and so that then also affects the permissions that were on this file so once you have a set of sessions we can collect them into a data set and often people are are collecting the same kind of data from different participants that different people visit the lab they do the same thing and so you collect all those sessions into a single data set but of course the next very important thing is how do you actually organize the sessions there's various kinds of metadata associated with these sessions that are very important but the data set as a whole first is associated with various types of descriptions that are common across all of these sessions so we ask researchers for a title and a short description of the data set we ask them to specify and maintain permissions on this data set so users and groups of users who should have access whether or not they wish at this point to share it with data barrier and of course these are all things they can change at any point we also require them to explicitly specify and select excerpts if excerpts are allowed because we need to know which sections of video actually can potentially be released to the public also other kinds of documents files that just describe the procedures blank forms they used anything like that and various other information that could be interesting to researchers and also then we ask them to specify metadata on the sessions and so to do this we've come up with a kind of annotation scheme on the sessions and for users we describe these as groups so you can apply annotations or equivalently put your sessions into groups and these these groups could represent various different things they can represent a single participant they could represent a condition if you if you do have different kinds of procedures with different subjects if you have people come back for different visits you they could represent that and of course as an annotation system each each session can be applied can be placed in arbitrarily many groups and all of these groups are contained within that data set so for example in the case of a a group representing a participant we ask for various for particular kinds of information that are that are relevant to researchers and this is actually what we think is most and what we found from talking to people is actually the most important pieces of information so for a participant we ask them for a participant ID usually people just have unique IDs that they generate themselves usually just numbers we ask for birth date gender and race ethnicity those are sort of the most important three things and once we have the birth date here and of course we have the session date and session we can calculate ages and ages for researchers are are one of the very important things that that inform their research we can ask for various other things all of these things of course are optional and and in fact this is a very generic kind of measurement system you can add your own measurements so if you have other kinds of pieces of information that you collect and you want to store with your sessions you can add those on and this is sort of our our way of of while not enforcing any ontology allowing people to create their own so if somebody adds a measure to a particular participant or to some other kind of grouping a condition or something like that other researchers will see that this is now available to them and they can reuse this measurement so just to make this very concrete and this should be pretty straightforward you can add if you have all these sessions you can group them by participant to say which participants were in each of these sessions you could also add a grouping for visits if it's actually important which day they came or something like that and and so for example this this might represent something like a longitudinal study where people come back every year I mean you maybe have a partial data set like this and the thing that this this allows us to do that having this this general grouping system is be very flexible in terms of of representing and both import and export of data and so what we found is people often organize their data already in a folder structure that kind of represents these groups but people have different folder structures depending on what they're thinking about some people might think the participants are actually the most important thing and have participants as their top-level folders by subject ID and then have separate vision visits inside those other people actually might focus on visits in the longitudinal studies a year one year two year three and so in this case we can allow people to export their data in either of these forms depending on on what they want or equivalently imported so i started talking about studies and and studies we see are sort of now this layer that you can place on top of data sets data sets provide a very good organization for what people are doing in their own labs to keep track of data as they're collecting it to keep track of analyses they're doing but often they want to present their data to other researchers in a very different way and so studies are sort of a data reuse mechanism in which you can in it within a study say i'm going to use these sessions from this data set and these sessions from this other data set in in any flexible way here the here this is how i want to represent them and that will be the face of your research data to the world and we found that this is a very important thing to to make researchers feel comfortable about about how how their data actually look and so so once along with pulling in the sessions to a study you can also then add other kinds of analyses you do that was the thing i was talking about in terms of layering coding other kinds of of aggregate analysis you do within the session and so just as for example the diagram you have all these data sets and in particular these data sets do not necessarily need to come from the same researcher people can aggregate across different data sets from different labs from different locations and pull these into a study represent how all of the raw data that went into this study represent the products of of this study represent pit associate that study with published papers journal articles presentations they've done anything like that so just to talk a little bit about our ingest process the ingest process really starts by talking to a researcher talking to a contributor and asking them to identify data that they they want to contribute in that they can contribute if they have the appropriate release forms and then it's up to them to sort of determine how to group their files into data sets and studies and also to verify that they have all the correct permission information so we ask them to start by doing those things and also providing all of the top level information associated with the study description publications things like that and of course all of these things are things that they can change later times once we have all this information right now we're in a very manual data curation mode so we collect information spreadsheets things like that all of the files that they that we have we try to organize these into flat files representing the list of all the sessions the list of all the participants any other kinds of groupings they have right now we're just using simple csv to import things into the database and also of course we want to collect all of the raw data themselves the videos and we try to get as close to the original video as high quality as possible often this is not possible people transcode them into remove the originals very early in their process or do some kind of processing but we try to get as close to the original then we we transcode videos into a standard format we're using h264 so we specified a format that is appropriate for our needs in particular this is useful for us because it allows streaming via html5 video which which lets people get very direct immediate access to the data and right now a lot of these processes are fairly manual but we're working to automate these more and more and and i that's actually before sort of a big part of what we're focusing on right now is starting to move many of these processes from a very manual curation mode to more sort of self curation where users can do these do these things themselves and review their own data and yeah this is obviously a hard thing just to we have to have a system architecture diagram but basically we're have a simple sql database representing things i've already talked about and then store files on nfs file system we have a replica site and we have a web front end on how to solve all these things people have specific questions about that i'm happy to talk about it too but things we're planning for 1.0 which will be early next year as david said is of course appropriate study views and this is a big sort of refining this process to make studies look the way researchers like them implementing search and discovery features starting to do a lot more of the policy driven automating a user authentication things like that and self curation features i've talked about most of this stuff and overall of course we know that our goal is really we want to build a community and and that's really about doing three things at once we know we need to get more interesting data and that's one of the things we're focusing on right now we're going to high value data sets doing a very manual careful curation process on on these data sets which have not necessarily been for even been digitized in many cases and these are data sets that are users potentially will be interested in that of course will attract more users which will allow attract more people to actually contribute their data and hopefully we will be able to grow our community in this way so yeah just to summarize our aims again we've talked we've talked really about data barriers a video sharing platform and about our policies but if you have questions about any of that or our other aims thank you all