 My name is Chris Carpenter Negalescu. I'm head of the web group at the Internet Archive. I'm here with a number of folks who I wanted to introduce, all of whom will be, will all be speaking to you for about 10 minutes each today. And then we really want this session to be interactive. So we're hoping that you'll interrupt us while we're presenting and that we'll also have an opportunity for some interactive dialogue at the end of the session. Amongst the panelists are Martin Kapatovic, who is with the Smithsonian Institution Libraries, Dean Kraft, who is with Cornell University, and Mackenzie Smith at MIT. Now, we wanted to preface this session a little bit. The description obviously was designed to inspire some enthusiasm and some discussion, but we wanted to caution you that this is really the beginning of the conversation. We are not coming to you with package solutions and answers at this stage of the game, but we're hoping that we can spark some ideas and some directions and inspire some collaborations amongst the institutions within the CNI network. So we're gonna start off with Dean Kraft. I'm gonna talk about Vivo, which is a faculty profiling system, and talk about the relationship of the Vivo system with Linked Open Data. So Vivo was originally developed at Cornell University in about 2004. It was originally implemented as a relational database system, although sort of building on semantic web ideas. And then it was re-implemented using actual semantic web tools and technologies in 2007. It's a faculty researcher profiling system and now covers all faculty researchers and disciplines at Cornell University. It was implemented at the University of Florida in 2007. And the underlying system, which is really a sort of an ontology editor and presentation system, is actually being used by other folks around the world. Recently, in September 2009, there was a large National Institutes of Health grant for us to develop and implement Vivo at seven institutions and create, potentially, a national network for research scientists. So Vivo data is stored internally in RDF triples. It uses the shared Vivo core ontology, which I'll talk about a little more in a minute, to describe people, organizations, activities, publications, events, interests, grants, news releases, all sorts of other information, really a lot of context about people and researchers, how they interrelate and what they do. The Vivo core ontology extends to existing, somewhat widely used, linked open data ontologies, friend of a friend and bibliographic ontology. The system will also support local ontology extensions. So other, you can, if you have local information that you wanna add and extend, you can do that. If you wanna build additional information about an area, you can do that and re-release it. And we're intending to develop mappings into other standard ontologies as those standards develop. So here's just an example. You can see the researcher, Susan Riha there, with a bunch of relationships to news releases, to papers that she's authored, to areas she works in. This is the kind of information that we maintain about our researcher or faculty member in the system. And here is a small subset of the Vivo ontology. So you can see friend of a friend, part of it's in green. You can see we get a lot more detailed. One of the main differences between our ontology and the sort of widely used public ones is we have a much higher level of detail and granularity in the descriptions we're providing for people and the relationships that we're establishing. Okay, so how do you get our linked open data from Vivo? Right now you can get the RDF, all the triples, associated with a single individual by just using an RDF browser or the right HTTP call going to the individual's URI. The next release, and I will say that just became available in the most recent release of Vivo, and it's implemented at Cornell and it's being implemented at the other sites. The next release, which we're hoping to have up by the end of January, we'll have an RDF enabled index to allow crawling of all the instances, all the individual research faculty or organizations or other pieces of information that we maintain and access to all the RDF. One of the questions is should we make available some sort of a site map for our RDF information? Should we just make available a nightly snapshot that lets people download everything that we have to say about Cornell? So that's an open question. We do not make available a public Sparkland point. Sparkle, if you're not familiar with it, it's like SQL for Sematic Web, and it's really Sparkland points are easy to make queries of that can cause lots of problems, so they are not that reliable and too vulnerable really to make broadly available. Here's an RDF call, again, a little piece of it for if you go to my Vivo Cornell page and get back the RDF. So down at the bottom, I don't know if you can see, you can see I have their presentations I've given. It talks about my status, relationships express my status within the institution, grants that I'm the PI on. Okay, Vivo enables authoritative data about researchers to join the linked data cloud. As institutions implement Vivo and make it available, that information then is public and accessible and can become part of the cloud, and you can see our little Vivo circles in that yellow box there on the slide. We made the latest release of the cloud diagram. Okay, if we are authoritative and we're part of the linked open data cloud, what is it that we link to? Well, kind of a limited set at the moment. We link to, we have geographic information about where a research project takes place, what a news release refers to, and that geographic, those geographic URIs link to the geographic information and DPPDA. Right now, we're a bit reluctant to link to uncontrolled information. I mean, we could link to entries in Wikipedia about our faculty and simply say these are in fact the same. We're a little reluctant to do that given that we're providing authoritative data about people. But of course, anybody's willing to take our information and assert those relationships in some independent system. We will be asserting same as with known author identifier systems like Orkin. We're also very willing to link to publications and authoritative repositories, to link grants to authoritative sources, such as NIH Reporter. They will need to provide permanent URIs. Some of these systems don't do that at the moment to say nothing of actual RDF. And willing to link to things like the mesh subject headings or other control vocabularies, they do at least have fixed URIs now, but not RDF. So what are some of the challenges that we face as we make vivo information available as part of the cloud? One of these that's come up with several institutions is privacy. Fortunately, Cornell is an opt out of making your sort of directory and other information available. There are institutions that are opt in where basically you have to get the faculty member to give permission before you can publish any information about them. And there's a concern that if you aggregate a lot of information about people or an institution, you can just sort of, just by doing that, you can reach some sort of privacy limit. Another challenge is combining our stuff with other linked open data ontologies. I mentioned that we do sort of highly granular sets of relationships and information. Our approach is basically pick on things that are really obvious to extend and use and don't sweat it. Make our own information available. If other people want to use it, then they can do the mapping from our detailed and published ontologies to those other systems. There's an interesting presentation challenge of information. Faculty often want to give you say a selected list of publications or suppress certain grants or other things that might be part of their official record. So in the presentation of information, they may have selected stuff, but right now we're making everything available as part of the linked open data. So you'll see all their publications, all their grants, all that information. And it's a little challenging to decide how to present that in this RDF linked open data format. Dirty data is another challenge area. We've just spent six months disambiguating and cleaning up data from digital measures activity insight, which is used by about half of the colleges at Cornell. The problem there is they do offer options for structured information, but they also offer free text options for almost every field. And we have to figure out how to turn that into structured information. They don't in the publications. Authors are just by name. We've got to try and again match that up with the other existing co-authors at Cornell or other institutions. So the bottom line here is it can be a lot of work to take unstructured information becoming from a lot of sources and turn it into structured information that you really can share and easily make available. Provenance is another concern for linked open data. We draw on multiple sources. Faculty can provide their own information. We draw on sources of record from the university. We draw on from sites like PubMed, external sources. Within the Vivo system, we actually maintain individual private graphs so we can find out the provenance of any statement about an individual. But there's a challenge as to how we make that available. We can't do it just in the RDF. We can potentially do it through an API that we would write for our system. Finally, temporality, if that's a word. It's easy to state a fact in RDF. It's a lot harder to state when it was true. People are constantly changing relationships, grants, other things. And expressing that can be a bit of a challenge. Opportunities, there are lots of them. This really makes the information much more usable for people outside and for people within Cornell. We're reusing it within the institution. Individual department websites are now drawing from the Vivo information to populate the website. We have portals for impact statements where you can find out for a New York County, what are all the sets of research and publications and researchers that are targeted on that particular county in New York or a country in the world. Graduate education and the life sciences. In the life sciences we have faculty scattered over all sorts of different colleges, fields, departments. A graduate student or faculty member coming in can have a lot of trouble finding an individual. This resorts all that data and makes it available in a way that's easy to associate with a research interest. Extensibility, I mentioned that you could make local additions to the Vivo ontology. The University of Melbourne has created a data registry system built in ontology for data sets that they've added built on top of Vivo. And we're gonna take that ontology actually and reincorporate it into the Vivo core. Integration with other systems, there are lots of exciting tools out there now. We've used Open Calais, the developer libraries to process Cornell news releases to tag all the individuals and departments and other information with their Vivo URLs. So ideally, we're working with our communications department on this. When you publish the thing, people would be able to get the context of the individual, the department, project, whatever. And finally, Vivo is a source, it delivers source for faculty, researcher profiles and context for other systems if you're building a system of resources, research resources or data curation system or whatever, you can draw on this to associate it with a particular faculty member or researcher. And again, give all that context so other outside systems can draw on this Vivo information. And that's my presentation. Oh, actually we were gonna say, if you have specific questions on what I've just talked about, happy to take them now or we can save it for the discussion at the end. Any specific Vivo questions? Yeah, if you wanna, I don't know if you wanna go for the mic or we are recording this. So. Maybe I wanna rethink my question. I'm gonna ask Harriet Hamassie Brown University. You were so honest and forthright in your discussion of the challenges. And I wonder kind of overall how you actually feel about the system. I have a particular reason for asking that also because Brown is interested in becoming a Vivo partner. So I wonder if you could speak to that. So certainly within a single institution which I think is where Cornell started, I mean it wasn't until we got the NIH grant and very recently and started looking at the idea of building a national network of information to develop collaborations that we consider publishing this as linked open data. And I think that's where many of the challenges arise. Within a single institution, we found this system to work very well. Cause we can control the ontology, we can control the way the information is presented, we can draw from the databases of record of the institution. I mean there's certainly work involved in all of that, particularly taking people soft, your grants database, whatever and pulling that into Vivo. But once it's there, you can make a lot of interesting use of it. I talked about some of the repurposing. So in that sense, I think Vivo looked at within the institution is pretty solid. It's when you turn outside and make the information more broadly available that you start to get into more of these questions. I mentioned provenance. Within our system, we have no problem establishing the provenance as we expose and publish the information to be used by others. That becomes more of an issue. Anything else? All right, who's up next? Good afternoon. The title of my presentation is almost more of a challenge internally to the Smithsonian than actually a call to action by the audience members. Many of you know that the Smithsonian does have a number of facilities here on the mall. And it's sort of historically known as the octopus on the mall as we have sort of acquired museums and research centers here in the Washington area. But buried deep within all of these Smithsonian facilities and research outposts is a vast amount of information that we need to get out to a wider community that's currently buried into a lot of silos. And one of the things we're looking at is using the link data concepts to sort of bring this material out into a more open area where it can play more easily with others. So here's a sort of a nice aerial shot of the Washington DC area. And a lot of those areas all up and down the long the mall are filled with Smithsonian facilities. And I just sort of made up a sort of a quick little graphic here to show sort of some of the linkages and connections here. So you can see sort of all of the different facilities along the mall, the museums that are visited by yourself and many others. But we also have a number of other research facilities, both formal, large unit type research facilities such as the Astrophysical Observatory in Cambridge, Massachusetts, the Tropical Research Institute in the Republic of Panama and the Environmental Research Center out in Edgewater, Maryland. Additionally, we also have Smithsonian, a formal Smithsonian research facilities in places like Arizona at other observatory there and Mopala, Kenya and also in various places in Florida at various research, coral research stations there. Additionally, there are thousands of Smithsonian researchers traveling around the world doing research in various facilities on either affiliated partners or as sort of co-appointments at various other universities. So how do we bring sort of all of this information together in places where we can get it out there and for more useful purposes? So I like the images of the cosmographic star maps. I sort of think of all of this research as being sort of star points that we need to bring together and sort of a coherent structure in some way and the different ways that those structures are brought together are gonna depend on sort of your point of view and what you're looking for. So depending on what you're looking at, you'll see this data in all sort of different ways. The first example that I wanna use is sort of one of our first stabs at creating data that's in a more open fashion is a project spearheaded by the libraries, which is the Biodiversity Heritage Library Project. This is a consortium of 12 different libraries in North America and the UK, as well as a number of global partners now in China, Australia, Europe, and a place I always forget. Did I say Australia already? Australia. So the purpose of this is to sort of bring together the digitization of all of taxonomic research. So descriptions of species. And one of the things we've found is how do you link into species can be very dependent on your specific subdiscipline within biology. So for instance, botanists link in at a description level and zoologists link in at a article level. So how can we create linkages within this data that meet all of the different resources, different needs of the communities, as well as within the bounds of our resources? So right now we have about 32 million pages of text within the Biodiversity Heritage Library Project. The main linkages that we are able to create are through the species names. So again, using algorithms that locate the binomial, Linnaean names, we can create what we call discovered bibliographies, which then become linkage points for things like collections information systems and other large projects like the Encyclopedia of Life, which I'll mention again in a minute. So again, these are sort of ways which we're creating the link points into species information for different types of projects. And then here's just an example of where you can actually link into a specific descriptions within a page at the species level. Encyclopedia of Life, which is one of the partner projects within the Biodiversity Heritage Library Project, is again a key way that you can create these links. Just sort of in the lower left, or one side of the screen, you'll see some species names that are found by the algorithm and those are creating links automatically on the fly back into the Encyclopedia of Life project. So all of this discovered bibliography at the species level then gets automatically linked in with EOL. EOL itself is a vast aggregator of data and they have a number of, right now, beta projects for creating link data. They're part of the 20% time of the development staff at the Encyclopedia of Life. And the goal there is again to sort of bring in and create those various link points within the Encyclopedia of Life data so that again it can go back out into things. Wikipedia is one example where there's sort of a cross fertilization of data going from EOL into Wikipedia. Things happen to it in Wikipedia and then it comes back into the Encyclopedia of Life so there's sort of a cross fertilization between those two projects there. What I like to think of right now is what we have is Smithsonian in a box. Not yet available on Amazon but I'm sure at some point it will be. And what we have is about 137 million objects in our collections and a vast amount of data that goes along with those. But it is really pretty much wrapped up in a box right now and the way you have to get to that box is to go to the various Smithsonian silo, get into the data, use it, navigate the often very complicated rights that are associated with that data in various forms because in many cases we don't know the rights to some of that data ourselves. And then figure out how exactly you're gonna get that data. So the first example that we have sort of tried to do for this is we've created a project called collections.si.edu or collection search. And what we've done for that is pulled data records from about 6.2 million specimen collection, specimens and artifact collections, pulled it into a large database and then it's searchable and indexable and you can create widgets into it. But again it's still a standalone box so there's still not yet that the open APIs into it, it's still not an open data source, it's still not a link data source. But sort of becomes the first step in sort of exposing a lot of that data in ways that aren't locked up in some of the proprietary collections information systems that we use and we have four different collections information systems at the Smithsonian that don't necessarily talk to each other. One of our hopes for sort of exposing things in a much more open fashion is the project Smithsonian Commons which probably many of you have heard Michael Edson discuss in forums all around the world. And the goal of Smithsonian Commons is actually to sort of clarify a lot of those rights issues around our data. One of the interesting things I discovered in the past year was that we actually have a hard time applying Creative Commons licenses to a lot of our material because since much of our data is created as federal employees we can't actually claim rights because it's all automatically public domain so our lawyers have a hard time actually giving us permission to put Creative Commons licensing on a lot of our data. But of course the researchers wanna see the CC license on something and we can't do that. So again we're trying to iron out some of those things and get ways that we can do that. Smithsonian Commons is one of the possibilities where we can actually put that data out there in clear ways. And in the discussions with Mike Edson one of the things I really wanna do is make sure that we actually have in addition to sort of the pretty pictures and other things to make sure that we can actually have other types of data sources available and there's specifically the scientific data so that it's more openly available and clear cut as to what people can do and access the data in various ways. So in terms of our linkages what I'm thinking right now is that we're still sort of over here on the side with the primitive bat powered rocket ship that would take people to the moon in 1836 when this lithograph was created. So we're still sort of a lot of bat people sort of trying to pull this craft up to the moon. And what I'd like to get us to in the next year so is sort of over to Jules Verne earth to the moon where we have a large projectile missile that's doing a little bit of a better job and getting it right there. So one of my questions for all of you is sort of what types of things would you want from Smithsonian data? How would you like to access Smithsonian data? How would you like to link, interact and work with Smithsonian data? And how do you want to build your own bear? Thanks. If you have any questions now now we can move on to Chris. So you've heard a few use cases from a range of institutions that are dealing with sort of the pragmatic issues around data in general and specifically in the context of trying to open it up and make it available to link. I wanted to talk to you today about a slightly different use case and this one we arrived at from a slightly different place specifically the Civil War Data 150 project. It's the sesquentennial anniversary of the Civil War beginning next April and there's going to be a five year recognition of the anniversary. And one of the projects that we got involved with with a wide range of institutions primarily at the state and local level. These are historical societies. These are archives and libraries that are operating in communities that have very interested not only scholarly communities but also your average armchair historian who is incredibly passionate about a particular topic. Why did the Internet Archive get involved with this? Well, one of the reasons we kind of stepped up and said we'd like to figure out how to do something innovative in terms of putting together a program that would have a little bit of longevity so not a really short window of time long enough that we could engage and collaborate with all of these various institutional partners but also produce something with some longevity coming out of the program so that this would be ongoing resources and hopefully some best practices in learning that the community could share together. Part of our motivation was we don't do a very good job as an organization linking our own data. You can mine our TV archives or our web archives or our image data sets or our film collections but if you're trying to locate material from a thematic or topical perspective it's very difficult to locate the broad spectrum of resources available. So part of the motivation was to test out some of these theories and practices within our own institution and figure out what were the barriers? Why were we not able to get going and provide more open linked data sets to the community? So the Civil War Project has two goals certainly to figure out how to facilitate collaboration across very diverse organizations to pull together resources in a manner that does not require the resources themselves to move around so that the idea is you take advantage of the distributed nature of the data and use that as a benefit versus a negative in terms of the end user trying to come and locate these things you could actually create services that link these things together in meaningful ways without having to go to every single individual repository to discover that those things exist. But at the same time once that discovery occurs being able to drive the user into the repository that has that resource that they really need to get access to or really want to get access to. So again trying to come up with that balance between learning about the availability and the presence of a resource but then also ensuring that that relationship between the institution that houses and preserves and maintains that data has a relationship with the community that's accessing those materials. So we came across two primary projects that we wanted to experiment with. One was if we could create a really low barrier to participation in terms of the institutions aggregate data that would allow us to assemble applications like for example we have proposed the idea of tweeting on every day of the anniversary and linking to relevant resources and information that would be of interest to individuals relating to that day at that point in history. Another application that we wanted to use to kind of illustrate the value of having these kinds of open data sets was the idea that a lot of the photographic resources have very limited information about them. For example, they may just have a very generic name or title associated with the negative that was contributed to a particular collection. Sometimes there's very rich and robust metadata associated with that but it runs the full gambit and often there's very little information about the people or the place or the exact period of time. And so we put together two very fundamental goals to try to increase the information relating to these resources. And the first is we'd like to be able to associate a resource with a place and to do that in a compelling way that allows us to really be able to automate display via a map. So you could take different slices of your particular interest and it might be an individual soldier's migration over the course of the war through different regiments or specific battles that they might have experienced. The idea being that we wanna be able to associate very specifically place and objects and then also person and objects. The concept of marrying genealogy with these really rich photographic resources was a primary driver behind the project. So to give you a little bit of idea of how we plan to move forward with this is right now we're asking institutions to contribute data in whatever format they can. In general, we're finding that the lowest level, lowest common denominator is usually CSV files. So we're likely to get majority of our data in CSV. We will get some data in XML based on content management and publishing systems that are already in place at many of these state and local institutions. And if we're lucky, we'll get a little bit of data already in an RDF format. But the ultimate goal behind it is not to put the impetus at this stage on the institution to produce a specific data format but to go ahead and handle some of that initial massaging on our end and then come up with a set of agreements and best practices with the individual institutions regarding how they would like to receive information back in terms of enhanced metadata on particular resources and how we go ahead and actually receive updates from the institution. Because just as we're using this anniversary as an opportunity to push this program forward, many of these libraries and archives and historical societies are doing the same. They've gone out to their funding communities and said, this is a major event. We'd like to digitize this additional percentage of our collections or we'd like to make this additional amount of resource available through these body of services. So we know that over a period of time we're going to be receiving constant updates and we will also be supplying these back. And part of the goal in terms of where the enhancement comes from is being able to leverage crowdsourcing in order to supply some of the additional information. And we would ultimately like to be able to leverage some of the commercial services that have already deployed open source tools and resources that make it easy for individuals who are contributing data to be very specific, for example, in their selection of geographic data so that it's much easier to automate the mapping at the end of the day. And also from an institutional perspective that we have a mechanism for actually gating who is contributing in the quality of their contribution. So without obviously creating a negative relationship with the public, being able to measure sort of the quality of specific inputs and assertions that are made by individual contributors and weight those so that we're at least getting say three to five votes, if you will, for a particular assertion that this photograph was taken on this battlefield on this day and these individuals are in that photo. The idea being that we know in more automated ways whether or not individual contributors are giving us good information or bad information and over time we can just weed out the individual contributors who may not be meeting our quality standards. So there was a lot of institutional concern about how do we make sure that the data we're receiving back is of significant quality. And some of this has come from experience working with Flickr and knowing that a lot of the comments that are submitted for specific resources are of high quality but a lot of them are for lack of a better description crap. Things that don't add any value to the resource or might even be false information which is obviously not a goal at the end of the day. So what we're learning so far in this process in terms of getting this project off the ground is that the biggest single factor in limiting what we can do in terms of providing open data and linking it is the lack of awareness. There's a lack of awareness within funding communities at the state and local level and I would argue also in some cases at the federal level there hasn't been enough education about what exactly is being delivered in a project of this type and what's the benefit coming out of the end of the process. We've identified a number of organizations that unfortunately have just fallen victim to inertia. I mean, we're one of those organizations. When you're busy, busy, busy every day with the day-to-day work of your institution, the perception is that this piece or this type of project just adds another layer of work and where's the benefit and how do you sort of mitigate the long-term management issues and concerns about ongoing support. There are also issues of organizations are stuck with old technologies that are serving the basic needs when you're dealing with access via a web browser but when you're trying to actually produce link data that would allow another application to point into your records and you don't have a way to actually give them a persistent URI to point to, those are all challenges that institutions are facing at varying layers of degrees. Another big area of challenges certainly been shared ontologies. For this particular project, we're proposing to use the dire compendium as sort of a foundation for the vocabularies that we would use to link resources. We don't know yet whether or not that would work. We only have probably seven data sets that we've been trying to map to the compendium and at this point it looks promising but we may determine at some point down the road when we add 10 more that this is not a good organizing mechanism for this particular program. I know Dean mentioned this but there's a real problem with the diversity of data and the dirty nature of the data. The amount of effort that goes into disambiguating especially people and places cannot be underestimated. So at this point we don't know yet what that will mean for this project and how much time we'll end up investing in that particular exercise but without some shared infrastructure for these types of ontologies, it falls on the working members of each of these individual project teams whether you're deploying a project within an institution or across institutions to come up with these solutions and approaches. And Martin mentioned this specifically, the rights issues have also been a factor. There is a real question in the United States around can you license metadata and in what form and how does that take and what's the institutional role to that and if you have other third party metadata as part of that collection, how do you treat that? So there's a lot of complex layers that have to be at least peeled away in terms of maximizing the amount of reuse of the data over time. So to end on a more upbeat note, the good news is is that when institutions do get engaged and do get involved, it quickly illustrates that it's not as hard as it sounds. We're presenting some of the big challenges that you run into along the way but at the end of the day, if you pick a collection of data and you start small and demonstrate sort of the value of how that data can be integrated, whether it's just within your own institution and you're connecting your own collections or your own resources for the benefit of your faculty or your student population or a broader initiative where you're trying to map across a much broader set of domains, the reality is that once you start to demonstrate the value proposition through very real working applications that end users can interact with, the enthusiasm is there and the momentum comes along with it. We've certainly seen that inside the archive in terms of just the enthusiasm of our own technical staff who used to spend a lot of time arguing around RDF, RDFA, the technical debates were nonstop until you started producing a few of the interfaces on top of it and then everyone's like, oh, that's so cool. Conceptually, you start to build momentum around a different set of exercises where you're taking the perspective of the individual who's using the application and I think that's one of the things that we've learned in this project is if we can get institutions to focus on the value proposition and the end results, benefits a crew. And another big lesson that we certainly have learned in working across very diverse institutions is the concept of flexibility and best efforts that you'll never have perfect data sets and that if individual institutions wait until their data, they feel like they can, ooh, I'm handing you off a nice, clean data set, you'll never actually get to the point of exchanging data. And the same goes for some of the shared services. At a certain point in time, you have to leap and that net appears and you know that I'm gonna trust that this other partner that I'm working with is gonna deliver that piece of the program and we can move forward with that assumption. I know that doesn't always work in the perfect world. We all are accountable in varying degrees and ways for specific deliverables and so creating those dependencies on each other can sometimes be challenging but I think in this context, there are tremendous benefits for moving in that direction versus having everyone reinventing the wheel over and over again. And I think that the biggest lesson coming out of this project at least for us is the benefits of partnering. Until you're actually looking across a really diverse set of resources and data sets in a specific theme, if you will, in this case, the Civil War, trying to understand that you're gonna have sheet music, you're gonna have photographic resources and texts and service records and other types of memorabilia that are integrated into a whole, you can't really come up with a robust way of representing that space and partnership certainly helps in allowing individual institutions to bring their unique areas of expertise to the program and the process. As we draw this session to a close, I just wanted to summarize some of the points that have come out of these three case studies. There are other case studies that we could throw at you but I think you've gotten the major themes at this point. So let me just see if I can summarize it and hopefully engender some questions from you or comments about the state of this and please don't be shy about that and thinking about what you'd like to know. From my perspective, and I think you've seen this in all three of these case studies, the big value proposition for linked data in this community is this ability to aggregate data at scale, aggregate and publish data and make our data much more interoperable than it is now at a lower cost and with a lot less effort. So that's the vision. And what we need in order to achieve that as you've heard from all three projects is a tool chain, an infrastructure that includes ways of creating and editing data in this format in RDF according to certain ontologies or not depending on your point of view. And then tools to store and manage that data like triple stores and the associated tools that you need to search and manage that data. And then ways to get the data to the people who wanna see it, the end users. So tools to browse, view, visualize, navigate this data. That's sort of 10 years ago, we started a project at MIT called Simile and realized that none of that infrastructure existed or very, very rudimentary infrastructure. So we were sort of pushing the whole tool chain out at the same time and I'm happy to say that now things are a lot better than they were then. And I'm gonna be talking tomorrow at length about exhibit which is a tool to let you visualize and navigate linked open data with Eric Miller. So come to that if you wanna hear more about it. But the good news is that all three of these pieces of this infrastructure are much better. I'd say the weakest link right now is the ability to create and edit the data in the first place. Now it's all around. Yes, because just like with XML, there aren't really good generic tools that let you edit data in random structures, right? So it's not like this was never a problem before and suddenly it's a problem with RDF but we still don't have great solutions to this. So we end up spending a lot of time and effort converting data from legacy formats or spreadsheets and things like that into RDF. But it's making progress I would say. You heard from all of these case studies that ontologies are an issue because finding them is kind of a problem. This was true for XML schemas as well, right? And other schemas before that. You don't want to reinvent the wheel. That's part of the whole value proposition of using RDF and linked data. But you have to be able to figure out that there's already an ontology that matches your problem. And right now we don't have very good ways of finding those or sharing them. And then when you do find one, how do you know if it's any good? If you're not yourself an ontologist or an expert in that kind of thing, you might find 50 different ontologies to describe a person. And how do you know which one of those you should probably build your entire system on? So there's a lot that we might need to do in that area of sharing information about best practices in ontologies. A registry of them or some such thing. I hit it to use the R word, but we might need one. And then gaps in existing ontologies. So Dean gave you an example of one where they found some ontologies they could use, but they were not sufficient. So he had to add to those ontologies. And then how do you share that information back out to be able to map to other data that has similar low level granularity. So this cross domain mapping is work. As it always was, this doesn't magically go away in the world of linked data. But at least we have that common substrate, the common standard underneath it. And then finally we have what I call the three P's of linked data, which you heard again from these three talks. One is provenance. When you're talking about linked data, you're talking about assertions on the web. So a triple. And that can quickly become divorced from its source. So if the vision is, let's say I'm a researcher and I want to aggregate 100 records from 100 data sets. So now I have 1,000 records or whatever that is. And I need to know where each of those came from. If I need to understand how to cite it, for example, cite its source. So this issue of provenance and how we deal with that in a world of very disaggregated data is a challenge that we haven't quite solved yet. Citation is a particular issue here. So if people share data, they often want credit for having done that. So if you can keep track of where it came from, you still need a mechanism to tell the end user where it came from, like a URI for the creator. But what do you do if you've got that situation where you have 100 records from 100 data sets so you have at least 100 people that you need to cite as the source of the new data set, we run into this problem called attribution stacking very quickly. So there are some kind of tricky technical and social challenges with that, but people are working on it. And then there's the persistence object problem because you've heard a lot about how we can tackle persistence of data itself, preservation. But we often forget about the persistence of the metadata and in particular the URIs that we're creating because the whole link data web depends on persistent URIs to the things that we're referencing. And they themselves can be vulnerable sometimes and break, right? If you move the collection to a new institution and it's not called.mit.edu anymore, the URI changes and all those references to it break, right? So I'm doing some work now with the Creative Commons and they're very concerned about this issue of long-term persistence of URIs because I don't know if you're aware of this, but I can itself can take away your domain name if they feel like it. You have no long-term persistent guarantees to that particular domain name. So even people like the handle system or the DOI system that are claiming long-term persistence can't really guarantee that and they're aware of that. So this is a bigger issue than us, but we again need to make some progress on that. And then finally, the policy issue. Martin talked about the Smithsonian's collection and the challenges of how hard it can be sometimes to give data away. This is the O in linked open data. And one big issue is what rights do you have over the data in the first place? Is it copyrightable at all? Or are there data rights that you can apply to the data when you want to share it? And how do you make that clear so that the person who wants to consume your data is clear about what they can do with it and whether or not they need your permission or need to give you credit for having used it or not and so on. So this is a big challenge that we need to get a little more sophisticated about. For example, if you're collaborating internationally, here in the US, factual data is not copyrightable, but it is copyrightable in other countries like the UK and Australia where they have crown copyright. So technically speaking, you're trying to combine things that are under very different legal policies from different jurisdictions. And a lot of data is integrated and people just sort of look away and don't think about it very hard. But if the data has any commercial potential at all, you can run into horrible problems later on. Or even if it's just a matter of competition between researchers, you really want to do this right so that you don't run into big problems later on. So there are a range of licenses that are available, as Martin said, but figuring out which ones are the best to use can be very challenging because as he said, government data is public domain, so they can't use a license, but they could use a waiver, technically. And then if you use the CC buy license, which says, sure use my data, but you have to give me credit, then you run into this attribution stacking problem where you have to give a lot of people credit and how do we do that? And then finally, the share alike licenses that are very popular are really not interoperable because you can't combine two data sets that are both under different share alike licenses since you'd have to use both and you can't. So there are lots of these policy issues that get in the way. And these are not insolvable, but we really have to spend some time and effort thinking about them in order to get the vision of the linked open data web to where it needs to be. So those are kind of some of the issues that I think are slowing progress down, but as you've heard, there's a lot of good progress being made and I've heard of a lot of other projects too that are out there. So hopefully this has inspired in you some questions in our last five or 10 minutes here, or are you just stunned by the enormity of all of this? Really? Eric, you have to. I guess it's, I mean the panel was put together my battery died so I don't have my exact notes, but it was actually quite interesting. I mean, we had sort of three, sorry, Eric Miller and Zafira. We had three interesting sort of examples. One about basically, how do I effectively stitch the data together within my institution so we can accelerate collaboration? Another, how do I make it available so other people can use it in ways I didn't quite expect or intend and thus create value on top of that? That's what I sort of heard from the Smithsonian example. And then a third was how do I create value across institutions, across organizations around a particular topic or focal point? All of these traditionally very different challenges, all looking at a common substrate to reduce the costs. A common theme on all of these at a certain level was ontologies, policy, rights, provenance, things of that nature. Again, the nice thing is basically trying to solve this across so that we all sort of reduce costs. But if I may offer a way of prioritizing some of that, some of the quick lessons learned, I think go back to showing the end users fast and the sort of real quick wins. So ontology, vocabulary reuse, I realize this might sound kind of odd coming from somebody working in vocabulary reuse for so long, don't worry about it. Keep that at the lowest priority. Don't necessarily worry about the vocabulary as you need. I'd argue even persistence. Again, somebody working on persistence for so long. Getting some identifiers out there, you start to realize which identifiers are actually worth investing in and which aren't by how it's getting used. So there's a lot of interesting lessons learned and I'm sort of curious from each of your particular perspectives that we're approaching this in different ways. What were the sort of quick unexpected wins that you saw early on that might help gauge folks trying to get their minds around how do I actually go from where I'm at to these kind of directions? Did you quickly find once you integrated all of your data within Cornell, for example, the connectivity between faculty was increased in ways that the faculty felt like investing in ways they hadn't done before. What were the, that's just one example, but I'm sort of curious if you could put more of a human bent to it of how once you dealt with some of these technical issues, what are the unanticipated side effects? Because I'm just a little concerned that without sharing that, people are still grappling with, gosh, how do I jump over all these hurdles to get to some interesting deliverables? One of the, I don't know how unexpected it was, but our Vice President for Communications for the University is wildly excited about Vivo because he can, when he's doing a news release, when something happens, he can immediately find all the relevant information at the institution about the person, the project, the whatever, pull it together and get an answer, get a news release, get whatever out. And yeah, I'd say that was unexpected because that was not the original intent of the project, that wasn't the goal, that was a benefit that I think has accrued to the institution and is recognized at the highest levels of Cornell. I think one of the sort of interesting things that we discovered when we started putting out taxonomic literature in ways that was linkable is that how interested the Wikipedia community was in mind-altering plants. So for instance, now that you could actually create on-the-fly bibliographies to various plants and fungus that have interesting medicinal and non-medicinal purposes, we suddenly saw large amounts of linking back from Wikipedia into marijuana, peyote, et cetera. And then again, since these are created on the fly, it becomes an iterative process because then more and more content gets generated back and forth. So that was sort of one of the unintended uses that we found for this, what's generally speaking, rather dry scientific literature and suddenly taking on a whole new blossoming of interest. So in the case of the Civil War, I think the most surprising thing came from some of the smallest institutions that originally approached the project just looking at information they'd already digitized. And later came back to the collaborative and said, you know, we've got these amazing rich resources we haven't been able to afford to make available in a digital format. Could we provide and supply metadata that would let the world know that they exist? And yes, they might have to come to Richmond, Virginia or some other small town in Ohio, but the reality is now at least there's more information about them. There's ways that they could be integrated. We, of course, didn't come to the project with that in mind, but now are excited about the prospect of expanding the wealth of information that might be integrated just in slightly different ways than we anticipated. And we're out of time. Thank you.