 Good morning. It's a great pleasure to be here. I'm saying good morning, but I'm afraid my inner clock is saying good night still. Let me see actually if this would work. So my goal is to provide a brief introduction to research data issues. And actually, Najla already had a great talk and there will be some overlaps because there's definitely a common thread in this business. Why don't we say research data? Actually, we are really talking about a very inclusive broad term. Depending on disciplinary practices, data may mean many things. When we talk about research methods, there are qualitative methods, quantitative methods, and if you look at the outcomes, we very often think about statistics, but text can be an outcome of research project, specimens, codes, programs, and a broad range of materials. Inquiry, scholarly inquiry often starts with a question or with an issue to explore. And what we consider as scholarly communication circle is a very rich process and it has many outcomes including formal publications such as monographs and journal articles or conference papers and along the way, preprints and postprints. This is a very common, in a way, landscape for us in means of understanding what the outcome is of. But traditionally, we have been focusing on more formal finished products, although I must say last 10, 20 years with repositories, there's also great interest in capturing preference too. Actually, I must add that within the social sciences, especially for political scientists and economists, sharing the underlying data has been a very common practice. And what we are seeing, especially during the last five years, is that there's a greater interest in going downstream to the research data collection stage and understanding how data are being gathered and the nature of this data. And there are many reasons for this growing interest and clearly one of them is tax paid, taxer paid funding and accountability, so on and so forth. What we are really seeing is depending on stakeholders, depending on communities, there are many reasons to promote access to research data. There's a strong feeling that access to research data would also, in a way, leverage our investment by retooling, by reusing the same research data sets for other purposes. This topic is getting the attention of funders and in many countries and communities we are seeing funders to require data management plans. And in my home institution at Cornell University Library, we recently, well, it's the last two, three years, responded to the National Science Foundation and the National Endowment for the Humanities interest in asking scientists to start documenting their research process, especially in means of the outcomes of their research process. This is really not a requirement in means of sharing research data, but rather, in a way, a methodology to at least track down how these data sets are being created, how they are being stored and how they will be managed. And reacting to this request, we did develop a service point called Research Data Management Service Group. And actually, I want to highlight one point here. Cornell University Library is really only one of the partners in this data management group. The service areas, the policies, procedures really require close collaboration of a range of stakeholders. At Cornell in our group, we have our collaborators include Cornell's Information Technology Unit. We have strong involvement from the Office of the Research Vice Provost's Office and, of course, other advanced computing and supercomputing folks. And this is just to name a few stakeholders. So, Library is only a partner. And we offer a range of services from creating metadata to finding the right metadata standard all the way to understanding, you know, what kind of storage requirements this data set would be requiring. I must add that this has been a very fruitful process and we have been learning and we are more and more uncovering what we need to be doing, what we need to be finding understanding. But it's really just the tip of the iceberg. And let me just illustrate to you an example. Actually, before my example, let me provide you some data which would link better. As we were creating this research data management service group, we wanted to get a better sense of the scientists' practices and we actually decided to run a pilot survey. It's a pilot survey in the sense that we only administer it using scientists on scientists who have been involved in NSF projects. It was really not a comprehensive survey. It included 86 or so scientists. But I just wanted to show you a couple of tables before giving you another example. This illustrates in a way what I mean by saying research data means different things. And we were also surprised to see that almost 70% of the scientists surveyed they said that what they considered their research data isn't in the text format. And when we looked at the file formats, again it was surprising because they were very familiar, common file formats such as TXT documents, spreadsheets and JPEGs. So let me take you to another example now. This is a good way of looking at again what Nejla mentioned. We asked them. We said, you know, do you plan to share your research data or are you sharing now? And predominant of the answer was maybe but only small subset of it. And when we asked why they're hesitating to share their research data, most of the reasons described were related to information policy. They were not sure about the security provisions. They were worried about licensing, commercialization, confidentiality, so on and so forth. And actually another underlying reason which kind of comes through comments, free text comments is that many scientists are in a way somewhat insecure or not certain whether the data set they have gathered is up to standards and follows their communities, their communities' practices. It means of how data are captured, how they are recorded and the type of metadata surrounding their data sets. So let me give you another example before talking a bit about policies, procedures and infrastructure. Here's Dr. Walcott holding a pigeon. He's one of my favorite scientists. He's at Cornell and actually he's a premier neurobiologist and he has been last, I would say, 10 years collaborating with several PIs and often funds from the National Science Foundation and his research area involves loons. And what he does is he tracks these loons and he looks at their habitats, he looks at their nesting platforms and he tries to, his main research questions are related to how loons migrate and their nesting and their reproduction patterns. And actually he has been a premier in a way it's not his means of his groups because they started these methods of tagging birds so each bird gets their name, they track the bird and they try to understand the impact of many factors including the lake and the other environmental factors from pollution to temperature. And as I was preparing for this research I wanted to kind of see, I'm aware of this research project for a long time but I said, you know, what if I'm a scientist and I want to see what he has produced. So just doing some research I found out that there's a project site called the Loon Project. It's a really nice website because it not only includes some scientific information but it's really a good example of extending science to public. There is some information very accessible to public. But unfortunately this site actually resides one of the Co-PI's personal website. And then I found hundreds of articles and all from different places, some of them closed, some of them open. Here's an example of an article that appeared in the behavioral ecology periodical and continuing my search I found some data sets put in Cornell's repository. There are hundreds of them. You open them, their Excel spreadsheets, you look at them, you can't stare at them for a long time but they are really pretty complex with different fields and different values. And luckily I also ran into a contextual file, a metadata file. And interestingly it looks like they got really good advice or they had a very nice strong community because they did use a metadata standard called ecological metadata language. So there was a file to be able to open and understand how all these data fields are collected, what did they mean, so on so forth. But unfortunately it took me probably half an hour to open this file because it was a proprietary extension that I actually ended up really doing some kind of tricking, opening and capturing it as a PDF file. And let me see if this would work, but also I ran into hundreds of sound files and they are lovely because as I said, each bird has a name and that depending on the name, this is actually Carol. Of course, continuing my journey, I found video files and so on so forth. So the lesson I learned here is as a librarian it took me more than an hour to pull all these kind of examples together. They were all residing in different places, some of them were together but even if they were together, you really had to be really kind of intuitive to understand how they link to each other. And, you know, I find it very useful to have this sort of hands-on exercise because it kind of tells you how challenging this whole field of linking information is. And in a way, for many of us, this is the vision, to be able to bring together from upstream research stage all the way to many outcomes of scholarly communication process in a kind of cohesive holistic picture so that we could leverage not only the resources put into this network of research but also to be able to advance science. Well, such a vision actually requires an environment where we factor in several issues and just for today I am going to present to you four quadrants, four policy and process areas, and I put usability at the center because I'm really seeing it as a central issue that overlaps with the other sectors. Let's start with the technical one. And I think technical one is somewhat familiar to many of us so it's basically being able to build systems that would enable storage and manipulation of these files and their safekeeping through archiving. But of course it's not as simple. We want this environment to be interoperable. We want metadata standards that would really facilitate the discovery, validation and access and repurposing. We could have fantastic technical infrastructure but unless we pay attention to sociocultural issues many of these fantastic repositories may stay vacant. So therefore one of the key issues for us is to work closely with the scientists to listen to them and to better understand what their concerns are and what their needs are. And I must note here that I don't want to be discouraged by hearing from them that we don't want to share research data. I think what is important is to understand what would be the enabling or encouraging factors for scientists to start being more open about depositing data. A critical issue is really respecting their community standards and also understanding that there's a diversity of community standards. And also again another issue is understanding and factoring in their access provisions for some of them whether it's professional reputation whether they are concerns about the quality of data there are many many reasons. Some of them they do want embargo periods where for five years or for ten years the data would be closed. But again coming from the U.S. from some of the federal agencies they are trying to accommodate these mandates through requiring research data plan and depositing in a way requirements but also allowing some embargo periods so that this would not be an impediment for the scientists. And of course incentives and rewards and I think that's a very complex issue that will be evolving and we will be seeing new methods here. The other quadrant I want to briefly introduce is the information policies. Clearly just like any other information field research data has a range of implications from information policy perspective and European Union I'm really curious to hear about the copyright issues but in the U.S. actually research data are not covered under copyright so that also makes some of the scientists very nervous although you could use different agreements and you could put in place different procedures to be able to protect your data but still it's kind of a blurry line the copyright over research data sets. And I put retention and the accession as a metadata field here because I have noticed that at least at Cornell at this point we are just accepting anything they give to us. But you know in libraries and archives they always function with collection development policies and there are policies in means of understanding the quality of information but also weeding information. So that's really one of the information policy issues that we may need to look into how do we decide, how long to retain and when to de-access and the process. I actually suggest that we look at these research data repositories or services we are developing as a business. I see it as a business because if you look at the service framework we do need resources whether they are human resources, equipment or skill sets. We do need administrative models with management, governance models to put in place a structure for making decisions, implementing decisions, accessing decisions. Another critical issue is stakeholders. You know, librarians and archivists, curators, scientists, publishers and societies, governmental agencies, so on and so forth. We are all approaching it from different perspectives with different needs and it's really critical for stakeholders to be identified and to be understood if we do want this system to work for us. And of course communication and marketing as I said we look at it especially as a business to run. I put usability at the center of my chart because research data really is a bit different than some of the formats that we are familiar with such as monographs, pre-articles or images. So imagine running into a PDF file or a JPEG file, you open it, you look at it. Of course there is some contextual metadata that will help you to understand that data but still I think we have the kind of intuitive background or experience to be able to interpret. Whereas with research data I showed you a couple of examples. You know, we heard Carol singing for 30 seconds or we looked at a data set. For that data to be understood, especially contextualized in means of when they were get it, how they were get it, what these data points mean, you do need really an information framework around these data sets including information about how research was conducted, how long, you know, where and how long it was implemented, read me files to explain the data set, so on and so forth. So I just want to kind of caution us that with research data we may need to pay attention a bit more to the usability issues. And just to name a few other usability issues again, depending on the community standards ease of deposit will be really a very, very important factor. Unfortunately we often will be facing this ease of deposit versus completeness of deposit challenge. I'm the program director for archive at Cornell and one of the huge challenges we have is we have a rather simple metadata set for depositing articles to archive.org and we know that it's not sufficient but whenever we experiment with adding anything new we are seeing a bit of frustration or discomfort coming from scientists so it's just hitting the right point, getting the insufficient data but also not discouraging them from depositing. Again I mentioned that research data will be a bit different from usability perspective from some of the formats that we are comfortable with. One of them is to be able to not only understand but interpret and repurpose data often you will need tools whether it's spreadsheets, statistical analysis applications or visual applications. It doesn't mean that as libraries and research institutions we should get into supporting research data. However we should at least be aware of service providers to be able to connect those scientists who are retrieving data repositories with services so that they could analyze this data, they could mine it, they could integrate it, they could visualize it. That's really the full circle. And I have other things which are very common obviously we do want them to be persistently located so that there's an identifier that would work tomorrow and ten years from now and that citation standards so that there's a common and trustworthy way of referring to these data sets so again being able to track them back. And then my last bullet metrics to track and communicate impact is a very, very tricky one because I'm seeing that in bibliometrics community it's a bit divisive that especially those who are in the business of information management those who are funding research and the scientists they have very different takes on metrics and data, 20 articles produced and some scientists get a bit annoyed with that because you could have one article that has incredible impact and power versus 20 articles that you are just producing for tenure track for qualification purposes. So I try to just present some of the issues that we will need to address in means of policies and procedures and actually I want to conclude reminding us Scull in the Communication is a socio-technical system it's very complex and it has evolved through over several centuries and as we are really trying to introduce new patterns new expectations we really need to understand the dependency among these variables and I just was able to present today some of the top ones and understand that it is a system that if you want this system to work attend the different quadrants so that we will be able to advance science. Thank you.