 Welcome, everyone. It's time to convene our Spring 2011 CNI member meeting. There are a number of seats over here if anybody in the back is looking for seats. Let me welcome you all to San Diego. It has been a pretty long and nasty winter on some parts of the east coast, at least, and I hope that some of you who have come from there are enjoying an early preview of spring here and no snow. I know many of you have traveled long distances to be here, and we appreciate your efforts. I want to extend a particular welcome to our international attendees. I know that getting places internationally seems to just get a little more complicated every year, and we're very glad that you could join us. While I am making welcomes, I also want to extend a special welcome to some of Chris Borgman's graduate students from UCLA who have made a special trip down here to be with us while Chris receives her award, and I'm delighted that you were able to join us. CNI's founding director was Paul Evan Peters. Some of you knew him. Many of you, I fear, particularly as time goes on, did not have the opportunity of getting to know him. He died suddenly in 1996, and CNI did a couple of things, working with other colleagues, with his family, with partner organizations to honor his achievements and his memory. One thing that we did was the Paul Evan Peters scholarship, which helps to underwrite a graduate student in the information sciences every year. The other thing we did was the Paul Evan Peters Award. This is presented jointly by CNI and its sponsor organizations, the Association of Research Libraries, an edge of cause. The award, and you can find some information on it in this little blue folder, which we put in everybody's registration packet, the award really wanted to recognize Paul's intellectual passions to honor the sustained achievement, the sustained achievements of people who had made creative and lasting achievements that really led to change in the way we did teaching, learning, scholarship, the way we operated as a society. Paul, of course, was interested not just in higher education, but he was deeply concerned with the broader social implications of technology and how society would be changed as technology evolved. And I think that the award winner that we're going to honor and hear from today is a particularly fine choice in recognizing that passion of Paul Evan Peters. We had a nominating committee which took care of the award this year, and while only Joan Lippincott is here with us from that committee, I do want to recognize the hard work and contributions of the committee members, Marjorie Blumenthal from Georgetown, Nancy Eaton of Penn State University, now retired, and Bill Hogue of the University of Southern Carolina. They, I know, wrestled with a number of outstanding candidates. This is, when you think about the list of recipients of this award, Dan Atkins, Paul Ginsberg, Brewster Kale, Vint Cerf, and Tim Berners-League, that's a list that gets more challenging with every cycle. And I think that the committee has met the challenge wonderfully. The Paul Evan Peters Award for 2011 goes to Christine Boardman. She is the professor and presidential chair in information studies at the University of California, Los Angeles, UCLA. She is a longstanding colleague of mine. We've tracked each other's work, I think, for more years than I'm going to talk about, going back through trying to understand what large-scale online catalogs meant as sort of one of the first public access information systems at scale back when Melville was being deployed at the University of California through organizations like the American Society for Information, Science, and Technology. Chris has done a lot of things, and I'm not going to try and summarize them all, but I just want to touch on a couple of them. She has been deeply concerned with science policy and about policy issues in technology and society for many, many years, going back to the early days of things like the computers and privacy conferences. So she has really sort of found her credibility in that world, which not a lot of people have. She's actually done a good deal of system building over the years as part of teams that have worked on a number of scholarly information systems. In recent years, she's been deeply engaged with the Center for Embedded Network Sensing at UCLA, which conducts a really fascinating portfolio of work ranging from sort of very technical stuff around the deployment of sensor networks all the way through the kind of policy issues about how you describe and manage and share the streams that come out of these networks. She's had a longstanding interest in the interactions between teaching and learning in one side and technology on the other. And a number of you probably remember her wonderful talk a couple of years ago here reporting on the work of the task force on cyber infrastructure and cyber learning that she was asked to chair by the National Science Foundation. That report on cyber learning still stands as probably the most substantial look at what this panoply of infrastructure technologies actually could mean to teaching and learning as opposed to higher end research and the training of advanced graduate students. And I think it has been quite influential although we can still hope I think to see a real funded program at scale following on from that work. I could go on at considerable length about other things she's done but I just want to mention her work as an author and specifically these last two books that she's done from Gutenberg to the Global Information Infrastructure and more recently Scholarship in the Digital Age. These are both wonderful sort of synthesizing books that really look very, very broadly at a series of complex evolutionary developments that sit between the social and the technical with a dollop of economics and policy and other seasoning in there as well. But between them I think they make a tremendous contribution in outlining and illuminating the evolution of scholarship and scholarly communication as altered by technology in recent decades. I'm consistently struck as I look at them these two books at the kind of resonance back and forth between the interests and agenda of the coalition over the last 20 years and the developments which she documents, analyzes and illuminates in these works. So I'm particularly delighted to see this award go to in part recognize a body of synthesizing work that as I say sits on the boundary between the technical and the social and that is so consistent with the interests and passions of Paul Evan Peters on one side and the trajectory of the coalition on the other. Please join me in welcoming and honoring Chris Borgman as she delivers her Paul Evan Peters lecture, information, infrastructure and the internet reflections on three decades in internet time. Welcome Chris. Thank you. Thank you and thank you Clifford for that beautiful introduction. It is indeed a great honor not only for the award but the great honor and pleasure of being able to give it here in front of my many long time friends, colleagues, current students and current collaborators as well. May I have the slides please? It is indeed a rare chance to get to talk to this audience and Clifford suggested that I use this time as an opportunity to reflect on the field and on prognosticate where we might go from here and Clifford has always been one who's given me very good advice for several decades now and I took that as an opportunity and I've been working on this for some months and in the process it became I think the outline for the next book so I especially welcome your discussion and commentary on this and I'm going to conclude by laying out four grand challenges that are concerned with networked information and that appeal to what I think we need to do going forward in addressing access to networked information and bringing together the technology, the policy and the people issues. So the parts of talk are first to spend some time on where we came from then where we are now and where might we go from here. So as far as where we are now or where we came from let's talk about information infrastructure and the internet and the foundations that were made by the previous winners of this award. The first of course Tim Berners-Lee in 2000 note that this is actually the 20th anniversary year of the World Wide Web and plenty of us could wonder how we lived without it. I think what we gave him credit for was recognizing that what we needed was an infrastructure that was much lighter weight and that was easier to be an author and easier to be a searcher than was possible in the internet before. His real innovation was recognizing that hypertext was something that could revolutionize the way that we thought about access to information and about infrastructure. That infrastructure scaled wildly behind his or anyone else's imagination and he's continued to innovate with the Semantic Web and the Web Science programs since then. The next award went to Vince Sirf whom UCLA proudly claims for his Ph.D. in computer science and he's best known probably for the TCP IP work with Bob Kahn but then also for founding the Internet Society for his work in founding ICANN and his recent work is with Google and with NASA and the Internet Society on building an interplanetary internet and they really are working on protocols to deal with things like the time lag for packet switching which doesn't work very well when you're going from here to Mars and back. The third award went to Brewster Kale a delightfully disruptive individual. The last time I saw Brewster he took me by the elbow and said let's plan something disruptive together and he's still doing that. Again he's somebody who recognized that the internet needed archiving and access and I think he got there early enough he got ahead of the policy game if he tried to do it now it would probably be almost impossible to do. He built on an extant architecture and he built the internet archive in ways that have scaled far beyond what anyone expected and he's dealt with the contributed content moving into areas of personal digital archiving now as well as had a series of conferences in that area. Paul Ginsburg came next in the two year cycle. This is also noticed the 20th anniversary year of archive and again what Paul recognized was that we could rethink access to scholarly information. We could build on a behavioral infrastructure, practice infrastructure around sharing a pre-prince in such a way that we could get a open access publishing, we could get institution repositories and really scale up and speed up access. That scaled hugely the archive now gets over 6,000 submissions per month and averages something on the order of 80,000 connections per hour. There are several iPod or iPad and other applications already just for access to the archive. Dan Hatkins in 2008 the most recent award winner is another series of firsts. He was the founding dean of the School of Information at the University of Michigan. He realized that you could take a very strong but traditional program and library and information science and turn it into something much broader. He came out of engineering. He had been doing kind of big iron said I'm tired of just building faster cycles. Let's do something much broader. He was also the founding head of the also cyber infrastructure. He shared the blue ribbon panel which really coined that term cyber infrastructure. So he saw those opportunities and he also saw the opportunities around expanding the internet technology to a much broader base to support scholarship. And as Clifford mentioned he then recognized the need to expand that yet further to think about cyber learning. And it was really, it was Dan and Cora Merritt who organized that task force on which Clifford and I both served. So with that you've now got the librarian's daughter from Detroit. That's Betty Borgman long time reference librarian at Wayne State University and me sitting on the shelf. So I certainly genetic disposition to managing information coming into this field honestly. But I'm going to give you a few minutes of how I got here which I hope may be instructive as we think forward about educating our next generation of information professionals also. I got a bachelor's degree in mathematics from Michigan State University and the main careers open to women with bachelor's degrees in math were teaching high school or becoming programmers. Neither of which was particularly attractive to me and I tried student teaching and said I don't think that's it. So I did a non obvious thing encouraged by my mother which was to go to library school and we really did call it library school back then. But they didn't know what to do with me either and lacking a scholarship I taught algebra and trig in this wonderful building the cathedral of learning and then was also kind of paying the rent by tutoring Vietnam vets and calculus. And that's what Alan Kent the head of the program from whom I had first learned information retrieval from this book information analysis and retrieval starting in the earlier editions were 1962 and 66 recognized that he had this kind of math major floating around and nobody was taking advantage of me. So he hired me to work on some of his big National Science Foundation projects. He already had a NASA regional dissemination center. I worked on pirates the Pittsburgh information retrieval system. I did user training and interface design on that and I also staffed the first implementation of the New York Times information bank outside of the New York City offices and he was building up a very interesting bunch of people around him. But before he hired me he hired this fellow Paul Evan Peters and Paul was on the same set of projects I was on worked in another part of the team in other building and regrettably I didn't have the opportunity to get to know him as well as I wish I had but we come from that same training that same background and we went similar directions he went to Columbia University in library animation and I went to Dallas to the Dallas public library and the reason I went there was Lillian Bradshaw the head of the public library had managed to make the library the largest priority of city computing over police and fire. We wrote an online catalog in assembler language which we brought online in 1978 and Lillian Bradshaw made sure that that catalog was on the desk of every city employee who had a computer terminal that had to be the most networked online catalog of the 1970s. So we definitely got a chance to be first. But I was already seeing and trying to implement those online catalogs that the technology was not the hard part. The hard part was much more the people the management the institutions and the policy which is why I went to Stanford for the PhD in communication to work with Bill Paisley and Everett Rogers very well known sociotechnical researchers and Ever Rogers for the diffusion of innovations and Bill Paisley for spires the Stanford physics information retrieval system. I wrote a dissertation again on online catalogs the user's mental model of an information retrieval system and among the reasons for that is that OCLC had picked me up and funded most of my PhD by working as a research assistant on what was then this very large early 1980s accounts on library resources it was CLR at the time funded that went to there to research libraries group to OCLC to library congress to Joe Matthews and to the division of library animation of the University of California which is when I first started working with this fellow. So we go back a long way in thinking about networked information and as as he said is he was bring up Melville and we were trying to bring up other systems. I've been at UCLA since 1983 having been recruited by Bob Hayes from whom I also learned library animation that handbook on data processing for libraries educated several generations of people doing those early days of library animation. But with detours to work in other countries and in other places first to teach at Loughborough University in Britain where the British libraries funded a speaking tour around the country really got to know my colleagues there. Just after the political changes I was in the first wave of Fulbrighters to go into Hungary and actually among the first to go into Central Eastern Europe I was in Budapest and spent a large amount of time in the 1990s working with the Shorosh Foundation and that's where I really began to understand infrastructure and information and the Internet and how they came together in that very turbulent time of political and social change. The Gutenberg Global Information Infrastructure book really came out of what I learned of comparatively at that time. And Oxford where I spent the last sabbatical is where I wrote the first draft of the scholarship in the digital age and again the chance to work with that very rich multidisciplinary community and have access to the Bodleian Library as well as to the University of California resources was quite the heavenly place to be for a scholar. So where are we now? I'm going to take us through a tour of what I think are four trends that I've seen in that time from the 70s through today and use those to set up where I think we know where we think we need to go from here. So first let's talk about infrastructure per se. Cyber infrastructure is what is the term best known to this community of networked information. Dan Atkins and Tony Hay down there on the bottom right Tony was pretty much Dan's counterpart as head of the eScience program in Britain. Together they saw that you could build on infrastructure of technology people and policy to develop a new kind of scholarship. There's much more information, data intensive, distributed, collaborative and multidisciplinary. So that's popularized the term and that's kind of where we are now in thinking about networked information and scholarship. But it's worth spending a couple of minutes on what we mean by infrastructure. Because infrastructure is what we are building and sometimes it's visible and sometimes it's not very visible. It's not a new term, it's not a new idea, but it was a paper in 1996 by Lee Starr and Karen Ruliter coming out of the digital libraries initiative which many of us were working in at the time that came up with these eight dimensions of infrastructure and those dimensions which were then mapped into this very nice two-dimensional slide by Florence Millerans, part of a very important NSF report. These slides will be available. Don't try to get this small type out of here right now, you can get them later. Is a way of looking at the set of arrangements between them. So let's look at the technical to the social first. On the technical side, infrastructure is something that builds on an installed base. Mark records are fairly obvious one for this community. We now have who knows hundreds of millions at least of mark records around the world and that's part of what made it so easy to then build integrated library systems that you could pour those things into. But it also constrains the base. It's very hard to leave those behind and go off to something completely new. Infrastructure is the embodiment of standards. You need interoperability for infrastructures whether it's railroads, telephones, libraries, internet to work. On the social end of the spectrum, off on the right, infrastructure is something that you learn in your education, in your workplace. You learn how the University of California works. You learn how your particular institution works, the kinds of things we teach people to become information professionals. At the bottom, the notions here of being both embedded and divisible upon breakdown, you don't realize quite how dependent you are, whether it's the online catalog of the local area network until suddenly it ceases to function is very tightly coupled with other things. At the top, the reach or scope may be global and transparency in standard rulers terms is pretty much the opposite of being invisible. The Japanese are learning, for example, how deeply interconnected their power systems and their transportation systems are at this time. So hold that in mind of what we mean by infrastructure. I'll come back to that throughout the rest of the talk. These are the four trends that I want to take us through, a transition from more closed to more open world, a more static kind of information and context, a more dynamic one from a focus on readers to a focus on authors and the transition from publications to data. And it's where we are from these trends that leads us to the challenges. So first, this transition from closed to open. In the 1970s, we had an internet, actually in those early days of retrieval systems, it was not at all available to anyone beyond sort of the core research and military network. The early days of dialogue we were using dial up and we were using proprietary networks. So it was the research community, most of the services we were concerned with were bibliographic and even the network itself was closed. I'm going to see if I can date this audience, how many people remember what N-Ren was? Oh, a lot of you came along with me too. The National Research and Education Network, it was only open to universities and educational institutions. So we built up this base of standards, these cataloging rules, the mark formats of things that let us build these integrated commercial library animation systems. But it was pretty much an inward looking set of stakeholders. From the 90s on, it was about that actually Al Gore's speech on the information superhighway was at UCLA in Royce Hall one week to the day before the big 1994 earthquake. And that was the time that they changed the policy, they allowed others besides universities and research and military to interconnect with the network. And things have been very different in the time since. Networked information was no longer the fairly exclusive commodity around universities and research. The telecommunications moved to this more commodity internet. And the kinds of standards that we were concerned about became internet protocols, operating systems and the World Wide Web Consortium. So libraries, universities, educause kinds of communities became a smaller part of the overall enterprise and libraries have then had to expose their content to search engines to be discoverable. Where are we now? Now we're in this very funny interim stage. This is kind of mixed open and closed. We've still got largely open in many areas. But to borrow the phrase from Jonathan Zittrain, we may be moving to the end of the generative internet. His book, The Future of the Internet and How to Stop It is a highly recommended read. And his concern and mine here too is as we move into things like an app world, we have much more gated communities where you've now got three platforms for applications. You've got the Apple. You've got the Google Android. And now Microsoft is making a big play in their joint agreement with Nokia. And each of those players has complete control over what can be put on that platform. It's moving away from those more innovative days when anybody could program whatever they liked to work on whatever they wanted. And that's Zittrain's concern and mine here as well. The search engines also are optimized for commercial content. The install base is going to be much more outward facing in this mixed environment. So that's one trend. The second trend is this shift from a static world to a more dynamic world of thinking about information. In those days when Clifford and I were really kind of getting our feet wet here, when things got published, they stayed published. Libraries really could mark it and park it. It just doesn't work that way anymore. And think also about the difference in context. The time that we spent back when we required an entire academic term of a master library science degree learning to use dialogue and orbit and BRS, we trained people to learn very closely how to do those Boolean searches, how to read those blue sheets, how to understand exactly how searches were executed. So that if three different people executed that search at the same time from three different places, not only would they get the same result, they would know how and why they got that result and what they could trust. That's the big shift now is when you look at today's world, what happens is when you do that search, the search engines first look at who you are and where you are. The same search gets different results for different people, but the same person on a different device is going to get a different search because it's looking at things like geolocation, it's looking at accumulated previous searches. This is not the same kind of world that certainly we grew up for information retrieval. You've got many versions of documents, you look at things like archive and you see this whole chain of different things that have gotten published as the websites continue to shift. So this raises questions of trust and of reproducibility. This new world of dynamic information is not one that was built upon the assumptions of scholarship. This is not one that was built around an idea of standing on the shoulders of giants and of accumulated kinds of approaches to information. So it's this very mixed kind of environment that we have and so the continuity is very different. Information retrieval is no longer about just sending out a query and bringing back some set of relevant objects like we originally thought about information retrieval. It's much more like a snapshot in time. What you get right now is what you get where you are and with a number of other factors that you may not be able to know. It's capturing a flow of information. Now following links and relationships and treating objects as interdependent in many ways reflects the flow of scholarship better than the old model of retrieval did but it's a very different way of thinking about cataloging and thinking about metadata than the library structures we have built in. Thirdly, the scholarly journal. Is there anyone who does not recognize that image on the left side or who will admit to it anyway? That's the cover of the first English language journal in 1665, Philosophical Transactions of the Royal Society and that's the same journal on the right continuously published for 350 years. Notice that you've got the title, you've got the volume, you've got the year, you've got the state and responsibility, you've got the place. The journals have actually not changed all that much in that period of time. But the way that we think about how they're being used is changing in some interesting ways. The shift from readers to authors is one that first came to my attention by Kimberly Douglas, the university librarian at Caltech, who started as a reader services librarian and said, you know, we're spending a whole lot more time nowadays doing author services. So you think of that wonderful library at Ephesus and it's much, our history has been collecting information at the time that other people were done with it and then making it available to others to use. There was also a much cleaner line between what libraries did and what authors did. So what the publishers did, you know, when you're ready to publish something, you took it to an editor who would then work with you and do these various things and serve a global clientele. Nowadays, we've got much more of a do-it-yourself kind of world. Any of us who have published in conference proceedings in recent years know we need to work with these really painful templates. We have to produce absolute camera-ready copy. Now, we did earlier on use those horrible blue line papers where you had to type in those, you know, certain number of centimeter inches. Some of that is old and some of it's new. But authors also have to spend a lot more time doing things like negotiating copyright, posting the supplemental materials, depositing things, and they're being expected to maintain their own access to data under some of these new and changing rules. So the outcome is that the activities have been rebalanced in a number of ways and the relationships are changing, the roles of information professionals are changing. Some would argue also that we're moving to a stage of universal authorship where people who are writing on Facebook, they're writing email, they're actually writing narrative and writing text on a day-to-day basis. That part of it may be good, but there's other parts of it that are maybe a bit more questionable. Fourth is this data deluge. And that's where most of my research has been in the last decade. This deluge is coming down all of us and we're drowning, not only in data, but we may be drowning information in many ways. Much of that data is runoff. It's not curated, maybe it shouldn't be kept at all. We need to identify what's worth keeping, how to keep it, and the tools and services to make it useful. So how has that shift changed over time? Again, we've got about 350 years here and it was kind of that end of the project when the libraries and the archives showed up and said, okay, we're ready, you know, you're done, project's over, now we'll take those publications. And we get the social structure, again, long negotiated peer review citation. Data, of course, evidence has been around for a very long time, and it was always heterogenous, but we saw it as much more as processes, as interim products, and it was very much embedded in practice and it was something that sociologists of science might be concerned about. The scientists were concerned, but it was outside our world. Now it is very much within the world, but the nature of the publications is changing quite a bit as well. In this very data-intensive information-intensive environment, and speaking to multiple disciplines, we don't just write one journal article at the end of a multi-year project. We have these various snapshots in time. We have the multiple pieces that come out, and they need to be connected in some way. I mean, that's, again, something Archive is particularly good at is showing how we get from one to the next. This was version one, this was version two, this was version three. But a lot of places, that's not the case. We disseminate them in different ways, and the data are now processed and product, where now the National Institutes of Health, the National Science Foundation, Institutes for Museum and Library Services, many other funding agencies are expecting the investigators to keep the data to manage it to make it useful going forward. And we're also getting it disseminated in a lot of different ways that we can't quite make sense of. So where we are now is this social structure around publications is still fairly stable, and we're trying to emulate it in some ways for data, but we just don't have that history yet. And it's not matching behavior, it's not matching incentives very well. It's something I'll talk, come back to at the end. There's another, there's a workshop coming up this summer on dealing with issues of how you give credit and is the question whether people don't cite data well enough or is it the data or not in a form that's easy to cite? Who should be in charge of metadata? The authors, libraries, repositories, several dissertations going on amongst my students who are dealing with a number of these things. But it's not all clear yet who's going to take responsibility for what content, for what periods of time, much less who's going to pay for this kind of stewardship. Okay, where might we go from here? This is where I want to spend the last part of the talk and pose some large challenges to you going forward. This background slide is from Vince Serfs, an interplanetary internet project. And the bottom right is from our portion of the data conservancy where we are billing ourselves as the curators to the stars. Which is to tell you the sky is actually not the limit of where we could go from here. Certainly if you listen to Vince Serfs it's not the limit. But I think the next stages from here are very hard which is why I put them out as challenges. And they're ones to this community and all stakeholders in networked information so that we can be thinking about where we want to be going forward. What do you want information infrastructure and the internet to be 10, 20, 30 years hence from now? So here's my four grand challenges which I will take us through. One is boldly to take back information retrieval. Second is to engage the entire information life cycle. Third is to distribute the architecture. And lastly is to match policy to incentives. And I will work us through each of these. I think there's people in this room who are in a position to think deeply about these things and to invest considerable research time in them. First, information retrieval. Libraries and schools of information have largely relinquished control of information retrieval both to computer science departments and to commercial development in the last decade or two. Search engines are optimized for unstructured data and for the commercial market. They are set to take that really clean, simple little box that you can put two or three words in. And what they do, they do very well. But they don't do everything and they don't do enough. In many respects they're a step backwards from where we were starting to get to of tailoring retrieval to the different kinds of information, the different kinds of data, the different kinds of structure. What you've got in astronomy is very different from what you've got in archaeology, much less what you've got in art history. These are some of the things that we need going forward that I have listed here under tomorrow. First of all, on discoverability. We need the generic search. The search engines are important to us, but we need to use them to discover some of those more specialized search engines. Who's going to invest in those, those specialized engines, whether it's for astronomy, archaeology, art history, or take your pick of fields is again one of the questions that we need to think about. Second on organization and retrieval. We need to rethink those and go back to some of the fundamental concepts of the 1970s. Questions of aboutness which were central and are still theoretically central to subject access are ones that we've not revisited in much too long. Search engines try to deal with aboutness in a little bit in a semantic way in terms of matching different words together, but they tend to do it within the four corners of the document. What they've done is they've let go of most of the cataloging, the kinds of description that is outside those four corners of the document, that sets the context for them. They've tried to do it without that. What they have done well is linking related objects at least in a syntactic way. That's where the hypertext and the citation work so well, but they've not done it very well in a semantic way of tying together related objects. Again, scholarship and writing and authorship work in sequential ways. There's meaning that we can pull together. This is where we need things like object reuse and exchange, and the Open Archives protocol for metadata harvesting as well, is we need to be able to assert these kinds of relationships so we can enrich those links. Aboutness is one way to get there, but linking in a much more semantic way is something that much more research is needed and has been let go by the wayside. Thirdly, these questions of reproducibility are essential that we go back and think about. We've moved into a search engine world that has gone away from sense of trust and sense of reproducibility of science and of scholarship. It's a very different kind of retrieval that I hope that we can rethink. Second is to engage the larger life cycle. As I mentioned several times, librarians and archivists have generally waited until other people were done with something, until it was published, until the scientist was done with the data, until the records had gone to the end stage of the government agency before they take them over. But the market and the market doesn't work in this environment. Information professionals need to be engaged throughout the entire process and to partner much more closely with domain experts. If we're going to manage data, we need to understand those contexts around them. My graduate students, several of whom are here, really are out working in the field with people. I have wonderful stories of people falling into quicksand following water quality projects around another one getting altitude sickness in Peru, laying a seismic network, and so on. They really understand what these data are and where they come from. We need to have more of that kind of engagement in our education and in the way that we partner with professionals. We need domain experts. We need to partner with research teams, and we need to stop separating data from practice because there's any hope of really getting reuse or reproducibility, much less achieving that standard from the Open Archival Information System, the OIS standards, which says independently understandable is a basic requirement. That's a very, very high bar to achieve, and it's not one that you can possibly achieve if you only show up with a bag at the end of the day. Thirdly is to distribute the architecture. Data is scaling much faster than is storage capacity or are the pipes of the internet, and that will happen for the foreseeable future. We have, as of the February 11th issue of science on data, we now have more data than we have storage capacity. If there ever was a time when saving it all was an option, that's gone. What that also means, though, is that old information retrieval notion of surrogates is back because you can move surrogates around the network in ways that you can't move big amounts of pipes around the network. Things like object reuse and exchange are built on returning to this notion of surrogates for discovery. Metadata are really important, and there's quite a bit that we can do with them. Similarly, we don't have the capacity of the pipes. The astronomers that we're working with, when they want to move terabytes of data from the east coast to the west coast of the United States, they put terabyte disks in 100-pound boxes and they send them by FedEx. That is the fastest way to move large amounts of data from the east coast. to the west coast. And because of that, they're not trying to move large amounts of data to where their computation is. They're trying to move computation to the data. Now, libraries are barely beginning to deal with the notion of taking data at the end of a project. They're not really in a position of becoming major computing centers to process those data. Which means that we need to bring together the data and the assets and the stewardship in some new ways. We need to think through the economic conditions to share the access and the assets. We've got a huge duplication of effort, and at the same time, we've got a number of economic issues of incentives and sharing everyone wants access to these data. But it is the tragedy, the commons problem. If the data are the grazing land in the middle of the town, who is it who's going to pay for getting enough grazing land to accommodate everyone's sheep or everyone's computation or everyone's reuse of these data? And it's kind of a class economic problem, and I think we don't have enough economists in the room. We certainly don't have enough economics training in most of our information professionals programs, but that's another thing we need to think about, the economics, the technology, the policy and the behavior of how we're going to design an architecture that is going to bring these together. The fourth and the final of the challenges that I offer to you is the need to match the policy to the incentives. And this was the topic of the talk that I gave last September in Beijing for the joint North American libraries and China National Libraries to think about sharing digital resources is we have not dealt well with matching the policy to the incentives. We're really pushing for getting people to save their data, to manage their data without a clear understanding of why they should do that. It's the reuse and the reproducibility appear to be the most important reasons, and if that indeed is the case, then those should be what would drive the policy to think about what's worth keeping. Our mission in the data conservancy, the big data net project that I'm part of and that Sayu Chowdhury who's here is the PI for, our mission is to treat data curation as a means to advance scientific progress rather than viewing it as an end in itself. And that requires a very different kind of engagement with the community and we've divided up responsibilities for observational data in a number of ways. UCLA is dealing with just the astronomy and Mary Marlino here is dealing with some of Earth Sciences, the Illinois group are dealing with some of the biosciences, and Dunbury and Biological Labs at Woods Hole is dealing with yet some more of different kinds of data. But it certainly colors the way that you would make these policy decisions. In working with different kinds of scientists in different communities over the last decade, it's very clear, they don't want to manage their data, they want to use their data. It's expensive, it's not well rewarded, it's highly inconsistent to manage those, and unless we align those a bit better than we have, we're going to get data management plans that meet the letter of the law but not the spirit of the law. The selection matters and the stewardship matters. What matters most to our astronomers for example is access to big stores of astronomy data where you have astronomers nearby who understand where those data came from, what the contextual issues, what the instrument readings were, what the local conditions were on those nights in those skies so that they can make sense of them. So this is another area where we should be concerned about the tragedy of the commons if we're not going to bring these data together and invest in them in some very useful ways. So that is the summary of the four and what I want to leave you with is we need to think collectively about how to take back information retrieval. The search engines are a necessary but no means a sufficient condition for the future of access to networked information, certainly not for scholarship. We need to engage the entire information life cycle if we're going to understand what those information, what those data are and to make them useful to anyone in the future. We need to think collectively about the architecture. We cannot save everything. We cannot store everything. We can't even move it from place to place. So where are we going to make the investments? Who's going to make them and how are we going to share them? And then lastly we all need to be thinking about policy and incentives and what the interests are of the people who produce and hold those data and those publications are. So this is a research agenda. It's a policy and action agenda, but it's also an agenda for education. It's the kinds of issues that need to be in our master's programs and our doctoral programs and increasingly in the undergraduate programs of the information field, ones that will produce another kind of new educated adult who may not necessarily find a career in the information world per se, but one who will find these a very useful set of skills. And I think if we can address these going forward we will have built upon the vision of Paul Evan Peters and on the foundation and the infrastructure of the previous winners of the award that honors him. So it's up to all of us to take it forward and may we be good ancestors. With thanks to my many co-authors, there were so many I just turned it into a word. One does not do this kind of research alone and the list of current students is quite long as well and I hope that I can offer you some very interesting, smart, productive and well educated masters and doctoral graduates in the years going forward. Thank you. I have some time for questions. There's microphones in the middle of the room. Here we go. And please identify yourselves. I'm David Rosenthal from the LOX program at Stanford and I guess I want to be a little bit of a devil's advocate here. You're arguing for the inadequacy of the generic means of access to information and I agree that they're unsatisfactory ways but they're hard to compete with because they have a business model that works and everything else doesn't. And the amounts that we can spend on providing specialized information access tools are insignificant in the R&D budgets of Google and so on. So that even relatively ineffective, even relatively specialized areas, they can actually afford to invest more than we can and they can afford to employ better programmers than we can to spend that R&D on. So they're just extraordinarily difficult to compete with and it seems that without some equivalently effective business model, this is all very nice but doomed to failure. Thank you, David. And you certainly have argued on the economic fronts here very articulately as well. And that's why I say it was necessary, the search engines are necessary but not sufficient. We're not going to have a large industry and the fact that there's not a large industry to produce fabulous engines for astronomy or archeology I think is a fact of life. So how is it that we get coordinated action? Do we take on better open source projects? Do we change the way that we look at research funding? Do we build in more tools and services over and above curation? The fact that a place like UCLA for as much we seem to be suffering in the budget cuts can still support 75 e-mail systems and we have managed to cut 26 course management systems down to 12 by going to I think 12 or 15 by going with Moodle still says there's a lot of room that we could bring things together and decide where the priorities are across the community. And I think we're seeing that and we're seeing that in research and we sit down and talk to some of these groups and ask them which information problems are solved and which ones are not solved and which they think are the big problems going forward. I think if we come out of data net being able to identify where the high value is if we can come out and say if we invest here, here and here we will get some big innovative payoffs. Those are the kinds of things we need to make some decisions about going forward. So I think I'm more optimistic than you are but I think we need to work together. Thank you. Your slide of the co-authors and funders made me think whether you could comment on the emerging skill sets or otherwise that are going to get us to where you've been talking about. The emerging skill sets of people of information professionals and graduates. I think they need a set of, I mean they certainly need more information retrieval than I think they're getting in most programs right now. They need to understand information organization. In fact, Bob Glushko at Berkeley and I are working with our students from both Berkeley and UCLA at a new foundational textbook on information organization retrieval that is trying to bring together perspectives from libraries, archives, business and computing because we need to put people in many different kinds of institutions so I think we need people who understand organization retrieval from more than just a bibliographic perspective. That would be a set of skills. They need to understand more policy, more information policy, they need more economics and they certainly need more technical skills than most of them are getting right now. I would like to see more people who have domain expertise whether it's a strong bachelor's degree or another graduate degree in an area. If we're going to have these kinds of partnerships with other groups we need to leverage those. So whether it's specialties in bioinformatics, specialties in law, we get a large number of people who are dropped out of being JDs and partners of places and want to come into the field. We certainly welcome them. That intellectual property negotiation is another great set of skills that we teach and we've added courses. In fact, we've added course in intellectual property to the curriculum with a very fine adjunct we have. Climb Guthro Colby College in Maine. I was intrigued and perhaps a bit unsettled by your comments on the fact that data was scaling faster than we could save it. Do you see that getting any better or is it only going to get worse? Especially in light of the NSF mandates that we have to in some way save this data and manage it. It seemed like we're sort of at odds in our ability to do that. I want to make sure I understand your question. So you're concerned about the scaling or how we're going to manage it at scale? Both. The scaling is a reality. And scale is something that's not going to go away and it's going to be a continuous chase of the bigger the disks we have, the faster the pipes we have, the fatter the pipes. It's like the 405 freeway between here and LA. The more lanes you add, the more traffic it's going to attract. That's not going to change. But we also need to make a distinction between data management and data storage. NSF has asked for data management plans, not data storage plans and not data curation plans. Throwing it away is a management strategy. And I think the tendency to equate data management with data curation and data storage is a concern. And it's one where we need to bring back that expertise in selection and appraisal. Archivists appraise a body of things to say which ones I want to keep. Librarians tend to select things out of this world of say what's published, I want this one, this one, and this one. In the data environment, it's a snapshot in time that there's no real equivalent. And it's not just deciding what data to keep, it's what goes with it. We're seeing, for example, that what the scientist considers to be data might just be those numbers in the spreadsheet. But those numbers are meaningless unless you've got a record of the deployment, of the instrument, of the settings, of the dilution level, of the pH level, all kinds of other things that sort of everybody knows. That's a kind of tacit background knowledge that doesn't get written down. So this part of what's the data to one person is different than the data to the other. Some of which could be reused and some not. So it's understanding that whole collection and deciding which of that is worth keeping and not and working with the community and deciding what can be thrown away because some of it is just going to have to get thrown away. Hi, I'm William Gunn. I'm here with Mendeley Research and I'm not a librarian of any sort. I'm actually a researcher. I've generated tons of purely curated data in my time so I hope my perspective will be a little bit interesting. The talk was really a fascinating long-term overview and I had two particular moments where I thought, oh yeah, this is great. She really does get it. And then I had another moment where I was like, oh wait, maybe she doesn't. I want to hear that one. So the point that I was really happy to see was when you started talking about data and data curation and what we're going to do with all this data in the data science community with the Hadoop tools. They talk about moving computation to the data all the time. The part that was a little unsettling was voiced by a gentleman earlier. It's the taking back and let's be the experts in search again. Certainly a lot of the learning and the tools and stuff that the search experts are using now have been created by these other folks earlier, people who are information professionals to start with. I'm not single for that, but it definitely seems like with all of this poorly structurally, poorly curated data out there that that's really the big need that only someone who has experience with managing large amounts of information and different kinds can actually address. So I don't know if there's a question in there somewhere. What I think I'm hearing as the kernel of that is, what you're saying is people who, you need experience managing large amounts of data and you need to understand the domain of the data. Right. I completely agree and maybe I didn't say that explicitly enough. It's been very striking in the areas that we're working with and I serve on co-data and I'm on the National Academies Board on Research Data and Information which deals with these things quite intently as well is that you need to know a lot about astronomy to manage astronomy data. It's very hard for somebody with an art history degree to manage astronomy data and vice versa. There's a huge amount of domain knowledge. There's a lot of hand-waving. There's huge amounts of tacit knowledge that don't get written down and if you're going to get anywhere near independently understandable later you need to understand the domain well enough to be able to determine what's worth writing down and is going to be needed to add that context around it is why I'm arguing for thinking through that whole life cycle and it's why we're certainly seeing the stewardship needs those people next to it. The filtering is done in different ways in different fields and in high energy physics as I'm sure you know is you put up a filter in that stream and the filter says this is what I'm looking for and if it's not what you're looking for it falls on the floor. You never really stayed in the first place. That's certainly how CERN functions. And then we've talked to people like at SETI the search for extraterrestrial intelligence they've got data with bit rot because they don't want to do that filtering so they don't know what they're looking for. So you've got very different requirements in these different fields and it's understanding those requirements and working in a domain way. So I think we're saying the same thing but I could do several hours worth of talks. In fact I teach two graduate courses just on data data practices and data curation and there's several universities that are moving in this direction and that's something else that needs to be in the curriculum. So Dean and then I think we should wrap up pretty soon. Yeah Dean Kraft Cornell University I'm just wondering if you could comment on the difference in between the challenges of big data and lots of pieces of little data. Oh that's a nice one Dean. And this is actually Clifford has opined on this at some point. University of California loves the word opine is that the challenges of data have less to do with size than with complexity. So physics data which is really big numbers, huge, huge piles of stuff may actually be a fairly small number of variables and fairly cleanly measured and fairly well understood where you get into the biological sciences and the question of what to measure is still so much challenged that trying to take care of a small amount of biological data, environmental field data where it's about a number of different kinds of organisms could require much more human labor and much more expertise than many times that in terms of disk storage of data in some of the physical sciences. So it's not, it's some size matters except when it doesn't. How's that? Okay, thanks. Well that was quite a wonderful speech. It took us over quite a span of years and left us with a number of futures to think about. Please join me in thanking Chris Boydman.