 So I'm Dean Kraft and this is Tom Kramer and we're going to be talking about our link data for a libraries project today. I'm going to take the first half and Tom's going to take the second. So we'll start out with there are going to be a lot of sort of pieces and parts that I will be giving quick introductions to and then talking about how we're trying to fit fit everything together. So let's start with the project itself. So this is a project funded by Mellon, link data for libraries. It's a collaboration of Cornell, Harvard and Stanford started this past January and we're working together to develop an ontology and a set of data sources that provide relationships, metadata and broad context for scholarly information resources. So just think of anything your library or museum or archive would catalog and the project leverages a lot of previous work particularly strongly to the previous work in the Vivo project and the Hydro partnership and we'll talk about that as we go along. So the vision for the project is to create a linked open data standard to exchange all that libraries know about their resources, a modest goal. So a bit more specifically we're creating a model that works within individual institutions and through a coordinated network of linked open data to capture the intellectual value that librarians and other domain experts add information resources when they describe, annotate, organize, select and use those resources together with a social value evident from patterns of usage and we'll get very specific in terms of use cases and so on about what we're trying to do a bit later in the talk. So, so what does this mean? I mean, so right now we have our wonderful library catalog that lets people search and find resources but it turns out there's an awful lot of information that we have in siloed systems that doesn't feed in at all to that discovery environment that that informs search in no way. So we have a wonderful registry of all sorts of digital collections that is currently unconnected with our black light catalog search. Another example, we use a system called LiveGuides where reference librarians do a lot of work to to provide specific resources for research areas. So here's an example where a research librarian has done a reference guide for, let's see, feminist gender and sexuality studies. So called out a whole set of important resources from the catalog and from other potentially other sources that would be of value to somebody looking in this area. But again, if you go to our black light search, if you search our catalog there's nothing that boosts the relevance of these materials. There's nothing that indicates that something you find there is part of this research guide. They're totally disconnected. These things are in silos. So another example, we used to have a wonderful physical engineering library that you could walk into and sitting in front of you would be a shelf of engineering handbooks. Well, the engineering library is now a virtual library, but we had no way to represent that shelf of engineering handbooks. So we created a little system called the curated list of library resources that led to pull out specific resources and and make them available either by call number ranges or individual selections. So now using this system we can model that shelf but again, it's still unconnected from the main library search. A second example our Clark physical sciences library calls out classic texts in physics, astronomy and chemistry and makes those available for students in those areas. So we have all these siloed sources of curated information that librarians have worked on that don't really inform the complete discovery system in our libraries and we wanted to help solve that problem. Last example, we have an entomology collection. Our entomology library is now also virtual. It's basically a subset of a larger library, but we want to have a curated focused collection and entomology to make that material available and discoverable outside of the sort of broad set of materials in the College of Agriculture and Life Sciences Library. So reference collection. One more example of information that we would like to be able to include to inform and enhance discovery. If you search our catalog for Philo Larissa, you'll find a resource written by Charles Britton who happens to be a Cornell faculty member. Wouldn't it be lovely for us to be able to draw on that information to give the researcher more understanding about what this author is, what he does? How could we pull this all together? So that's sort of the motivation for part of the project. Let's do one more of the building blocks. Talk a little about what is linked data. I think I'd less and less need to give this set of slides, but in case there's anybody in the audience who's not at all familiar with linked data, linked data is structured information, not just documents and text, in a common simple format. It's open, available, visible, minable. You can basically go to a url and get the structured data from it and anyone can post and consume and reuse it. And it's linked, so directly by reference, a url, uri, and indirectly via common references and inference. And the sort of basic underlying building block for all linked data are these RDF triples, which have a subject, a predicate, and an object. So for example, Italy has a border with Austria, or there you can make literal statements as well, Italy has ungeocode 380. So these very simple statements then let you build up relationships and networks of information. So why do we choose to use linked data for this project? It's a flexible and extensible framework that we can use to describe, organize, and relate scholars, scholarship, and the whole scholarly context. There are a wide range of tools, systems, ontologies, and vocabularies already available in the linked data environment. And it's a growing ecosystem of developer standards and sources of relevant linked open data. So here's a picture of the current 2014 instance of the linked data cloud. If you go back, you can see to the linkcloud.net site there, you can see how it's grown over the years. It's getting pretty big. There's a lot of information out there that we can potentially leverage, take advantage of, and link to. So another building block, a system called Vivo. So Vivo is an open source semantic web based researcher and research discovery tool. It essentially manages and makes available researcher profiles and all the context around researchers. So it's also data, institution-wide publicly visible information about research and researchers. It's a standard, a standard ontology, the Vivo data ontology that interconnects researchers, communities, and campuses using linked open data. And it's an open community of developers with strong national and international participation. So Vivo itself is a semantic web application. I've said this before. It provides data that's readable by machines, not just by people, although readable by people as well. It provides self-describing data via shared ontologies with defined types and relationships. So like the little RDF-triples I showed you, they're a set of defined relationships for all those kinds of links between between objects, provides search and query augmented by those relationships, and does simple kinds of reasoning to categorize and find associations. For example, if you have a faculty member who is teaching a course, that's a teaching faculty. So here's sort of the Vivo environment showing some of the kinds of relationships. Vivo connects scientists and scholars with and through their research and scholarship. So it may be their shared co-authors, or PIs, maybe they have positions in the same organizational structure within the institution, maybe they're working on the same project, they have a shared grant, they were cited in the same newspaper article, all sorts of ways that these scholars can interrelate. So let's get back to the Link Data for Libraries project, how the Link Data for Libraries builds on Vivo. So it brings the relationship and identifier-based architecture of Vivo to mainstream library use cases and applications. The Link Data for Libraries ontology draws on the existing Vivo ontology and ontology design patterns. We're using the software that underlies the Vivo application called VITRO, which it turns out you can pull out the Vivo ontology and plug in any ontology you want and have this complete semantic web system as a tool for Link Data for Libraries. Our demonstration search will be built on an adaptation of a demonstration system we build across seven institutions for Vivo at VivoSearch.org. And we'll be linking existing Vivo data, as in the example I showed you before at Cornell, with Vivo data from Cornell, from the Harvard Faculty Finder, and from Stanford Community Academic Profile Data. So another component of the project is the bib frame effort that's been going on for a while. So the Library of Congress developed the bib frame ontology as the eventual replacement for Mark, current cataloging standard for library resources. Both the Library of Congress and Zephira have developed converters that produce bib frame RDF, those little triples, from existing Mark XML. So you might ask why we're using bib frame as our bibliographic standard. Vivo itself uses bibo and there are other bibliographic ontologies out there. We are academic libraries. This Mark, we take advantage of all the information that's in Mark and we need to be able to make use of all of that information as part of the link data for libraries effort. So we want to mainstream the use of this data within the libraries. In case you're not familiar with bib frame, it provides structured information with the notion of both works and instances where instances would have a particular publisher and published location, where the work itself would be about a subject by a particular creator. And it's a somewhat of a simplification of the Furber model, but it still captures a lot of the kinds of context that you want about your resources and it really does capture pretty fully what Mark can express. So the issue is translating Mark records just into RDF will not in and of itself make useful link data. We need identifiers. We need local identifiers for statements made by our own institutions where we're using local authority information. We need global identifiers for the things we want to share for people, organizations, places and other things. I mean, even among our three libraries, we need to use shared global identifiers so that we can make connections across the three members of the project. And we're seeking to use standard external identifiers as well, OCLC work URIs, VF and ORCIDs and lots of other standards. So a goal of link data in general and our project in particular is to go from the sort of standard metadata string expression for things, people, whatever, to actual URIs that really represent the thing for people, organizations, places, subjects and all the rest. We are working now within our project with OCLC work identifiers. So OCLC Worldcat is a union catalog of bibliographic identifiers. OCLC is work to create these common work URIs across their bibliographic resources. We've actually mapped our own OCLC IDs to the work URIs and discovered that there's quite a lot, well, there's a lot of overlap from our own materials into work URIs at OCLC. And there's actually a pretty significant degree of overlap in works among our institutions so far we've looked at. So Harvard has 82% of its bib records can be matched to work identifiers. Stanford and Cornell almost of the resources we each have that have OCLC work IDs, almost half of them can be matched between the institutions. This then lets us combine information, annotation and usage information. We'll talk about this a bit more as we go into the use cases among the three institutions, which is one of the main goals of the project. We're creating this link data. We're showing how it can be used to combine information across all three of our institutions. And, of course, if it can work for three, from three you can go to many, many, many. Okay. A little bit about the link data for libraries ontology. I've already mentioned bib frame, which is what we're using for sort of library bibliographic information. We're using an ontology called Fabio for some additional bibliographic types and relationships. For people organizations, we're using Vivo and ISF, although kind of a subset of it, which includes the friend of a friend ontology, which is a big standard in the general linked data world. We're using the open annotation standard for annotations, PAB for provenance, for virtual collections and structured relationships. Some of the kind of curated information I was showing you earlier, we're using OAIORE. SCOS is a standard ontology for concepts, and we're trying to leverage many different global identifier relationships. So let me, this is way too tiny in detail to see, but I wanted to give you an idea of what the, where we are in the project now in terms of the ontology. So the ontology working groups, this was the November 18th iteration of sort of how we were trying to combine, deal with the issue of combining identifiers and relating them to works within the project. I understand the November 25th version is a little bit different than this, so this is very much an ongoing effort. It will also, it's also going to be changing to reflect changes in the external environment. Bipframe is potentially looking at an update to how Bipframe person is interpreted. But again, I wanted to give you a sense of some of the kinds of issues that we're dealing with here. So as an example, we want to be able to relate Bipframe's notion of person which is currently really a library authority notion with the real world person notion of a system like friend of a friend. An example that may or may not be fully accurate, but you can think of Samuel Clemens as the real friend of a friend person and Mark Twain as a Bipframe persona for that individual connected with a lot of works. Another example, we then want to tie out to these external identifiers and we need to tie appropriately. So ISNI has the notion of a persona, so Mark Twain may well get his own ISNI. Samuel Clemens may have a folk person relation to some other identifier, an orchid, whatever. In the case of, again, we may have information about people that is not currently captured by an external global identifier, but that we want to be able to maintain and carry forward. An example of this shows up a lot in Vivo where you have co-author information and you have very little information about the co-author. You may only have a first initial and a name, but you need to persist that in the system and make it available so that potentially other folks can reconcile that information and produce a global URI eventually. Final example, we need to relate Bipframe's notion of a work with the OCLC notion of a work and have that expressed directly. So what are the ontology challenges? So we need to think about in our ontology work identifying people and their relationships to other entities. There are already identifiers out there for people and works, which we need to try and connect via orchid, ISNI, Vivo, all sorts of them. And there are hard choices around the edges to be made. For example, in the case of Samuel Clemens or others, single people with multiple identities, we need to make sure that we don't get so obsessed with the details of how we represent the incredibly complex cases that we don't move forward on the general problem. So one aspect of the general problem is the notion of entity reconciliation. So if I have two representations, I have a Vivo URI for somebody and I have a VF for them, how do I reconcile those? How do I decide that, in fact, these are talking about the same person? And in general, that's a really important problem for us. Within our own systems, I talked about the local silos, we need to be able to link information across our library systems. In many cases, we'll have common IDs within a single institution, but in some cases, we don't, and we need to do work to do that reconciliation. It's absolutely essential to link across the three partners to support discovery, annotation, virtual collections. We need to do this for works, for people, places, subjects, and many other different kinds of relationships. And finally, linking the web of, linking to the web of linked open data, linking to that big cloud of stuff out there, services, new relationships and networks that we can use to enhance discovery and description and understanding of our resources. So I think the library role in all this is to expose our own unique entities and figure out how to connect them out to the rest of the world. And the more that we can link and the more that we can reconcile, the more that we're going to be able to discover. Let me talk about a slightly subtler version of that. If I can only reconcile between two researchers the fact that they're part of the same organization, I can only say these two researchers are very loosely coupled and they may not have much in common. But if I have a strong set of connections between them, then I can make the statement that well, if somebody finds a resource by one of these researchers, they might well be very interested in resources by the other one. These are closely coupled and it's not just people, it's other kinds of close strong coupling that can potentially enhance relevance ranking research and discovery. So how will the linked data for libraries project make these connections? We're using ontologies, commonly found in linked data. I've talked about some of those. By connecting with Cornell Vivo with the Stanford Community Academic Profiles, Harvard Profiles Information. By using persistent stable local identifiers, which we can then associate with global identifiers, including Orchid Viofinisni. By supporting annotations with provenance and by linking to external sources of networked relationships, things like TPPdia, IMTB, the entire web of OCLC information, all of those. And really what we're seeking here is using linked data as a standard that can serve as sort of a lingua franca across different organizations, across different disciplines, across international boundaries. It really is a standard that's widely shared and widely usable. So there's your quick introduction to the ontology piece of this and Tom is going to talk about the engineering side. Thankfully, Dean transitioned off the tower of Babel before we got to the engineering. I was wondering if he was asking me to cover the more perestly fraught part of the project. So Dean gave an excellent overview of some of the things that we're trying to do within the project in terms of unlocking silos of library information. And after our initial foundational meetings of the project launch once this January and February, where we really got our arms around what does this actually mean. We have a lovely 17 page proposal with great texts and long lists of potential ontologies to link to. And I think this diagram more than any other boils down to me kind of the essence of what we're trying to achieve. Which is that basically within each research institutions like Cornell, like Harvard, and like Stanford, we've got rich pools of data for describing our bibliographic resources, so our collections. We've got fantastic set of person data that might come through profiles information such as Vivo, Harvard profiles or Stanford's cap, as well as great links out to library world authority files and the growing set of researcher authority files like Orchid or Isney. We also have a large pool of data around what might be broadly classified as usage data or curation data that actually are the things or the objects being used, being gathered into the reference list, into digital collections, into core syllabi. And yet each of these things largely exist in their own sets of worlds with very few connections across. So link data for libraries is really looking at the libraries maybe is too limiting or has a very expansive role of what we consider a library is to look at exploiting these other pools of data at institutions and across research organizations and research structures to try to figure out how to answer key questions, not just about discovery though that's a critical piece of it, but also more fundamental questions leading people from one resource to another for multiple reasons. Now the multiple reasons, I don't know how many of you have been in a room with ontologists or self-described ontologists. It's fascinating and it can be time consuming and it turns out that no one is right and no one is wrong because oftentimes it's not quite clear what the objectives are and one of the early principles we established in the course of this project is we could have wonderfully ornate and elaborate models describing our individual data sets and our views of the world but what are we actually trying to do? What will the benefit be to some predefined set of users and an anchoring principle for this project is that the ontology development and the engineering would both be guided by a set of real-world use cases. Now I don't know how many of you have tried to come up with real-world use cases for linked data but typically there is a process where there's 80% of the effort talks about conceptually what you could do and then there's 16% of it on the ontology and concrete models and then there's a little bubble that says killer app invented here and so what we wanted to do was move that and the killer app I've yet to be part of a project where the killer app has emerged so I think Vivo actually has many of the aspects of one and one of the things that we wanted to do was actually try to characterize upfront what we were trying to accomplish using what data for what purpose and so this actually was an interesting process and we went through kind of we resulted to an agile form of requirements gathering called stories or use case development kind of just trying to boil things down from abstract notions if I connect this cloud here to that cloud there with some kind of predicate goodness will result to very concrete familiar structure if you look up stories or use cases as a kind of user I want to take some form of action in order that I can realize this benefit and then with each one of those trying to characterize what are some potential demonstrations of this from our various institutional contexts what are the data sources that then might be required I might need some kind of bibliographic data mixed with some kind of person data mixed with some kind of usage or circulation data therefore what might the ontology requirements be do I actually care that Mark Twain and Samuel Clemens are the same or different or the relationship between those and then finally the engineering work to support that so this I was pleased at how clean this slide looked when I produced it last night because actually this this was just an absolute riot working across three different institutions with three different engineering and requirements gathering cultures and different understandings of what use cases were that was sort of this is that's what we ended up to what we started with was incredibly messy or I guess we might say organic and robust with use cases well if it's a Tuesday this is Tuesday night on Astro turf with a left-handed pitcher facing the Giants in the 1960s what are the percentage chances we had some very far out use cases which if you've been in the part of link data discussions you probably had those two but what we did succeed in doing was capture 42 concrete use cases of various forms and various levels of aspiration pooling on the different drawing on the different pools of data described earlier and what was remarkably refreshing is by the end of about two or three months of analysis and combinatorial exercises we actually identified that there were really six clusters of use cases that we were talking about and we managed to reduce those to 12 pretty concrete pretty believable statements of potential benefit for leveraging link data in a library context and those clusters really dealt with the combinations across the pools of data that until now are largely unjoined or largely unlinked within our individual institutional context or across the institutions so the first one was really the use first use case cluster was the combination of bibliographic and curation data which has given a set of information resources and given knowledge about how those things have been used or tagged or classified could I actually drive relevance in discovery operations or processes based on that information. The second one was a combination of bibliographic and person data Dean gave an excellent example in the earlier slides with the search you can find an item in the Cornell catalog done by a Cornell researcher if you actually had the URI linking to that researcher's identity you could actually discover where that person's lab was what the current research was who that person's current collaborators or past collaborators were if you could do future that'd be a lot of money in that and we should write that one down. So all sorts of information about could I pivot between this knowledge and understanding of people and their relationships as represented in systems like Vivo with the traditionally rich but siloed information about library bibliographic resources. Third was really talking about linking into the wider world of linked data by leveraging what we call authorities within our context but based on the notions of not just strings but things for places for subjects for entities like people organizations can I drive better discovery processes and better analytics. Fourth was a set of clusters that was getting more into some of the more advanced features and capabilities of linked data so things like through inferencing multiple joins or following multiple links can I in fact get to something about follow a search on the Civil War and end up with a rich set of collections of costume design used in 20th century theater for 19th century costumery which is one of the more ambitious examples that we captured in the 42 raw use cases. The fifth was really driving the leveraging of circulation data. Based on how often a resource might be checked out consulted or used some of this data is available to us in various forms at our institutional libraries can I drive processes in Amazon like ways for people who found this useful also found this useful or the fact that this was circulated multiple times to faculty at a particular institution might drive recommendations at another institution. I have to say that this has provoked some interesting philosophical discussion among the partners and we're still sorting this about whether it's meaningful and whether it's a good idea and it's been expansive mind expanding and then the sixth one is really can we do an aggregation of the data what if we thought about not just bibliographic person and curation data and an individual institution like or three individual institutions but if we combine those across our three or across 30 or 300 institutions what might you be able to discover about the academic environment writ large. I'm going on at some length for these use cases because I actually feel one is because one of the things that we did so we're celebrating success but two is I feel like this is really a very concrete deliverable that maybe and possibly probably has applicability at institutions and for link data efforts beyond the link data for libraries project and one of the things that we had found in the course of this for example was that the bit frame initiative which Dean described actually also had a set of use cases taken from a slightly different angle and as we looked at those they were immensely helpful to us in terms of clarifying what our thinking was and what our objectives were. At the bottom of this page you will find the LD4L use cases I think if you do a Google search LD4L use cases it's the first thing that comes up but just to give example of two of the summary of the use cases from the first use case cluster which was about curation data and bibliographic resources so the first one is building a virtual collection one of the use cases we have at Stanford is we have a lot of people who are adding things to exhibits or research guides. Our librarians are very interested in having knowledge that that has relevance in a context outside the bibliographic record. They want that searchable and findable and to add to the discovery process our metadata department doesn't want them touching the catalog record with that information and we have the kind of this mismatch. This is a great example where we can do basically a mashup or a linking of these two disparate pools of data and actually meet a number of use cases which are broadly described as the librarians want to start tagging the catalog and when your librarians want to tag the catalog I think that's telling you something that you've got some fundamental opportunities. But as a faculty member or librarian I want to create a virtual collection or an exhibit spanning multiple collections so that I can share that with a class a set of researchers a set of students in a disciplinary etc etc. The second is related to that but it's tagging scholarly information resources to support reuse so as a librarian I would like to be able to tag a resource into a curated list so that can then feed those lists out think of RSS or syndication to subject guides course reserves reference collections personal profile pages and by taking a well curated list like the library catalog and by adding these tags I might suddenly expose all sorts of potential opportunities for information syndication and information referencing even across knowing in one of Dean's examples that a resource that was held in common at Stanford and Cornell's collection was in the Cornell librarians research guide for that topic could be useful information that we would want to expose within Stanford. So another thing that we has anchored the work on LD4L so far is that not only do we want to have rich ontology and not only do we want to have a set of use cases we actually want to have a set of pooled data and applications built off of that data to demonstrate value. So one of the things that we've done is phased the work in broadly speaking three big waves. So the first is to focus really on annotations so this use case cluster one that I just read through. The second is also well established territory and there's been good work though largely not well leveraged at our individual institution which is to leverage authorities and part of the day-to-day work of the library in terms of support of the scholarly process. So discovering works via people and their relationships. So discovering works via locations and their relationships there are rich and authoritative vocabularies and URIs describing places and the nice thing about earth is we all live in the same place and typically are writing about the same place. Lots of different local implementations and local experiments have been done to try to exploit that data to find other works about this place written by authors from this place and then you can also extend that into time. There's two more which is to discover works via concepts which is getting into slightly slippier ontological grounds and one is not to discover works the one that's not listed here is but to describe works actually using URIs and entities. So instead of going up to the very beginning of the process and instead of doing cataloging and description whether by librarians or by end users saying an institutional repository is encouraged them to start doing description using linked data whether or not they're aware that they're using linked data. The third phase is really getting into some of the more advanced work which is leveraging linked open data in general. So leveraging the deep graph that's a notion of inferencing and complex queries and joins. Leveraging usage data is does the fact that something was checked out 15 times actually make it more or less relevant. How do you actually express that data how do you how do you integrate that data into your discovery process or your recommendation process and then the sixth one is this pool of commingled data and at this point in the project 11 months into the 24 month cycle we're in these relative phases. So engineering on solid engineering work on the first one we're in planning and assessment mode for the second one a lot with the collection analysis that Dean described earlier and we're really in the research and kind of still concept conceptualization phase for the last cluster. One of the things that Dean mentioned is that we are trying to make sure that this is this work is well grounded in current engineering and information infrastructures of our three institutions and it's also applicable beyond the lifespan of the grant that it's not simply a research grant but something that's leading we hope to real information services. Dean actually put this in not me I have to say and then he put it in my part of the slide deck but one of the things I can say is Cornell is very excited about Hydra and as they were thinking as they were thinking about adopting the project they wanted something they wanted a solid foundation from which to start and something that would give them a head start and one of the things that they were looking at was using the Hydra application framework on top of a triple store on the back end and Cornell's information architecture basically has a triple store where they're amalgamating lots of different sources of things that other institutions like Stanford have already put into largely a digital repository they're doing a kind of a different take at the integration. So if you're familiar with Hydra it is a application framework that sits on top of Fedora which is content middle where it's a very common digital repository system. There are collaboratively built solution bundles that are used by many parts of the community for systems like digital collection presentation description institutional repository services. It has at this point I think 26 or 27 signed partners and the project motto is if you want to go fast go alone if you want to go far go together. I think speaking to one of the reasons that Cornell was interested in seeing their work in this being picked up by a wider community and also sustained by a wider community over time. As it so happens that's true as Stanford for example and other hydro partners are quite interested in this. Just in terms of anchoring the technical componentry basically there's an application stack which uses solar and black light for the read and presentation view and it uses Ruby on Rails gem called Hydra head which assembly is the application logic and what's presented to the users and all of that is written into Fedora on the back end and I think the interesting thing here is is there the possibility to actually replumb Hydra so instead of using a component called active Fedora which is what lets you put a Ruby on Rails stack on top of Fedora making it look like a relational database management system. You're putting Hydra componentry on top of a triple store so you can actually explode all of your data into RDF store a bunch of associated blobs. I don't know why I'm talking about this, you've got me going. You can store a bunch of associated blobs but in the fact with a componentized architecture can Hydra effectively become the native triple store and the native link data store for libraries at large. Now what happened between the time the grant proposal and many parts of this slide were written and now is that the Fedora 4 project went into production release last week it is a native RDF store and in fact Cornell has cleverly adopted the name of an existing component written by Oregon State and now taken up by the digital public library of America called active triples to do exactly what was so presciently described in the grant and in fact we've seen a great example of convergence and reuse where you can actually store linked data for libraries and others inside Fedora or inside a Hydra stack as a core component of the architecture. So well predicted Dean. So the good news is that in a broader sense not always is all the ontological work and the use case development going on but a lot of the componentry to store and then resurface these link data and these RDF assertions are actually being built into a common library stack, library IT stack. A related part related part of this okay sorry I was touching Dean's computer and it went to a different screen is that for the open annotate or for the annotation use cases clusters where people are building curated collections or enriching enriching knowledge of records but not necessarily enriching the catalog record itself we are using a we're using an information pattern called open annotations or open annotation for link data which is to be able to describe these things in a generic way and store them. At Stanford part of our engineering work has been work on a component called Trianon which speaks to the second bullet there which is to be able to take open annotations take RDF express as open annotations convert that into link data store that into Fedora and to be able to retrieve and visualize that which is work that's also going on there'll be more discussion on that tomorrow at the Fedora for early adopters if anyone is interested. We're also interested in using black light this UI component of the hydro stack not only for applying annotation so a tagging interface that many of you have seen and already have in your catalog systems but then also be able to retrieve that information from underlying triple stores. So if this information was used in a research guide at three different institutions it would be able to surface that so it's actually not only having the underlying data and data models but at some kind of user interface that actually floats that back to researchers in a useful way. And then finally for this use case 3.4 which was emitted from the previous slide we want to actually for depositors into a digital library or repository systems actually be able to not just enter in random strings but URIs or good strong entities as they're describing the place of publication the topic or contributor to a work. Another component in this environment is library cloud and this comes out of Harvard's work over the last multiple years where they have assembled bibliographic pools from all across the Harvard library system and produced a data aggregation and a set of API's as well as some visualizations on a view of the Harvard library Harvard library systems data. So this is also one of the services that sits on top of this is something called shelf stack map no stack score with which is powered by an algorithm called shelf rank and over the years from the Harvard law Harvard law innovation library lab. I heard Tracy giggling sorry my apologies. My confusion it's clear is that the trying to figure out based on circulation status and circulation data what's been used more and what therefore might be more relevant and this is driving a lot of the early modeling that we're doing on the usage data. Library cloud is also the data store that we're using to try to figure out what the conversion into link data. So it's the source data for the LD4L instance. As we think about assembling the data so from those three big pools of data we're this is how we've kind of divided up the world at Stanford which is the first and the area that we're doing engineering work on right now is annotations as the least encumbered in the least controversial and with the open annotation model perhaps one of the best understood. So that's the working from right to left that's our nice rose column. In the middle we think that there's a lot of profitable work to be done in Lyme which is really trying to focus on understanding people place and agent people place agent and topic modeling relating those to strong your eyes have already been done in the community via many of the ontologies and sources that Dean described earlier and then reaching our records to the degree that they haven't already been enriched on the third area where we're seeing some co evolution I think with the big frame and the general library community is to try to figure out the best way to express our mark records and our other pools of traditional bibliographic resources as linked data and that's still a work in progress. After all of that is converted, conceptually that goes into one big triple store or a bunch of interlinked triple stores at a single institution which is then surfaced through a set of API's for linking. And if you imagine this diagram in triplicate one might think of Cornell Harvard and Stanford all aggregating their data as well into one even bigger pool at the very end or towards the end of the project. So some of the working assumptions and I think some of the things that make this process interesting is that this is a project if it's not unique is perhaps rare at in the scale at which it's trying to accomplish this. So I think individually any single one of these institutions this was to be a notable trying to do it in common across three institutions with the linking and common engineering and common tooling gives us a massive collection of potential resources bibliographic resources but also a pool of people and curation data. Trying to understand the pipeline and the workflows is going to be important and we've built this in as a working assumption that this is not a one time conversion that we're going to get right the first time because we've already done it multiple times and not well as trying to understand the pipeline and a repeatable process for generating relating augmenting versioning and backing out changes to this these triple stores. And the third thing is how can we build useful services that sit on top because again the purpose is not just to have more data. The purpose is to expose data and useful services to end users. Dean alluded to some of the challenges a little bit earlier. So one of the things I think we're constantly finding is the perfection is the enemy of the good or spending too much time documenting or tackling the point oh one percent of the use cases. The Mark Twain Samuel Clemens discussion was an hour and a half of some of the smartest people I know on a telephone call and unlit windowless rooms. I wasn't on it. But it was a time consuming process and I think the answer was we actually don't know at the end or there isn't necessarily a single good answer. The second is this notion of minting versus finding identifiers is there's already these masses of linked data and URIs that are out there. Do we really want to add to this pool with new identifiers if we're going to add to this pool how do we do this reconciliation or this linking across. It's a fundamental challenge and one that I think the link data community in general hasn't quite mastered. What are issues is your Samuel Clemens Mark Twain again. Are you talking about the same thing or are you talking about two things with the same name but slightly different. So same as or see also just the scale requires a fair amount of computational time and computational power. Our DevOps and sysadmins groups kind of blanched when they saw the request that came in from the link data engineer when he specced out the machine that he wanted. So his expectations were reset but it is kind of a fundamental challenge if you want to turn through lots of mark records and do it multiple times because you didn't do it right or well the first time. Leveraging other technologies we don't want this to stand in isolation. We're not I think none of the three institutions are interested in producing research where and certainly none of us are interested or Stanford is not interested in sustaining research where. And I think finally even though it is linked for linked data for libraries and we're looking largely outside the bibliographic box we spend an awful lot of time talking about bibliographic data perhaps because that's our richest source and because that's where so much work has already been done but it's a constant challenge to try to extract ourselves from that and think about the wider world. So at the end really I think if we're successful we're looking at a state where link data becomes the standard not an after the fact conversion process and ideally library descriptions of our resources will refer to identified entities that two things not strings or stings they will be discoverable in concert with other types of metadata from all different sources. They will be aware of and leverage other types of institutional and trans institutional data from the institution's identity management systems for example to create coherent and richer local authority files what is our notion of this entity. We'll have interoperability across libraries so we won't have to worry if Cornell's Mark Twain is the same as our Mark Twain and then we'll be able to interoperate across the wider world of link data on the web. So the project timeline is we roughly divided into four quarters so six months. Yes that's confusing because each quarter is six months which is not the academic quarter but the first quarter of the project we really were spent on trying to identify the data sources the existing vocabularies to use rather than creating one whole cloth begin the initial ontology design and we spent a lot of time on the use cases. For the last six months we're just coming to the end of phase two phases where there are four phases to the project that's a better term. We're looking at completing the initial ontology Cornell has done good work deploying and extending the existing active triples componentry from the hydro project and pilot data ingests into a vitro based instance at Cornell. We're hoping to have a half way checkpoint of the process at a major workshop that will be held at Stanford in February with the notion that we should invite 10 to 12 and we have invited 10 to 12 interested other institutions in the link data space to go into a deeper dive about the planning, the use cases, the ontology, the engineering, the assumptions and what we and the tooling that we expect to emerge from the project. We'll go in detailed overview of the ontology and the project as a whole obtain feedback on the overall construct and make sure that we're not doing things that if not limited to a single institution's parochial view are also not limited to our triumvirate partnerships. One of the things that we want to do at this is make sure that we're really leveraging the wider community of link data, both in terms of the existing entities that are out there and the services that are already being built on top such as by DPLA, by Vivo, by Share, also very interested in parallel developments by BibFrame, OCLC and Zephira in this space. The third phase of the project is really to move more towards live services with pilot instances across all three partners, beginning to cross-populate multiple data moved to the second and third phases of the use cases. And the capstone of the project will be this time next year when you may see us again, which is implement fully functional LD4L instances, massive relating of entities across our different institutions, and successfully demonstrated use cases across multiples of the clusters. One of the interesting things about the project is that it is really brought together not only three different institutions, but multiple different groups across institutions that sometimes don't always talk to each other. So at Stanford, for instance, our technical services and metadata department really is keenly interested in this because I think they see this as part of the future of bibliographic description and a lot of their workflows. Of course, the library technology unit, which is mine, is deeply invested in this. We also have researchers across the pool of three institutions who are looking at the information science in the theory. It's been a rich partnership in that sense. We are seeking to expand the LD4L community and the effort. Everything is on a publicly accessible wiki, or almost everything is on a publicly accessible wiki, where the use cases were documented. We encourage people to read, comment, and contact us. And we will actively continue to try to exploit the relationship between the Vivo and the High, or with the Vivo and the Hydra communities in particular. Lovely screenshot. Meyer Library behind us has now got cyclone fencing around it. You can no longer go there. It's a big concrete monstrosity. That's why we took it there. It's the end of that. At the end of the project, we're hoping that for a couple of major developments. One is that we'll really have a much better sense of how these different ontologies and pools of data should interrelate to accomplish things that actually advance the mission of the libraries. Two is we should have tooling that supports the conversion, the relation, the editing, and the visualization of these. So not only in ontology, but actual engineering that makes good use of these and helps produce or manifest the data. And then three, that a lot of this, that tooling and the engineering will be deeply embedded in integral to the already large communities of hydra, blacklight, active triples, and vivo. So in summary, actually, I'll turn this over to Dean, because he's got a funny story. What he tells me is a funny story about the ivory-billed woodpecker. So, well, I don't know if you're familiar with the ivory-billed woodpecker, but it may or may not be extinct at the moment. And there's a lot of information on Cornell's laboratory of ornithology. So really, the message of this slide is that we really need to evolve and collaborate as libraries if we are going to stay relevant to the academic mission. I thought it might be so we could help find the ivory-billed woodpecker, but actually, you're saying we are the ivory-billed woodpecker, perhaps. We want to make sure we are not the ivory-billed woodpecker. You've got it. I'm the slow one in the group, or the literal one.