 I'm Will Sexton. He's Sean Arry from Duke University Libraries. We're going to talk about digital collections at Duke a lot this afternoon. Digital collections is a program that produces content and maintains content for a couple of different places on the library's website. The main one is at the top there. That's library.duke.edu slash digital collections. I'm also going to mention the archival finding aids to some degree. They are the finding aids for the Rubenstein Library, which is our special collections library at Duke. And they're at the URL that you see below. I'm going to sort of try to give some context and give kind of an overview of digital collections at Duke, leading up to framing the main problem that we're here to discuss today. If I linger too long in providing context and get sidetracked, Sean, please just give me a kick and I'll try to keep it going. Digital collections at Duke is mentioned obliquely in the library's strategic plan. There it is down at the bottom, highlighted in blue. And I'll bring it out and under goal two, providing digital content, tools, and services were kind of mentioned as accelerating the digitization of unique library's materials and increased access to digital scholarly content in all forms. I'm just going to keep going and maybe I can get it kind of back up at some point. So this is a digital collections grid that comes from a blog post by Lorcan Dempsey, in which he talks about the resources in the library. He organizes them into four categories according to the level of stewardship that the library provides and then also the level of uniqueness of the materials. And working with digital collections, we tend to work in that lower left-hand grid there with materials that are of high uniqueness and require high stewardship from library staff. They tend to be special collections but are not limited to special collections. I'm going to talk about how we approach this in terms of three challenges. The first is the organizational challenge and it kind of has to do with how we select materials for digitization, set objectives, support the processes. Very quickly I'll talk about that one. And then content, how do we sort of deal with all the variety, the very heterogeneous collections that we are tasked with handling? And then finally discovery, how do we help researchers to find and use our collections? So Sean and our members of a team called the Digital Collections Implementation Team, this is it here. The membership of this team has kind of expanded and contracted over the years. We've done some reorganization lately that has led to it kind of being stripped down so it has four members now but I kind of expect it to grow in the near future. We draw a Meditator Library from the Rubenstein Library Technical Services, a lead production developer from the Digital Production Center, which is where most of the digitization actually takes place. And then this is Sean and me, the lead designer lead programmer for the Implementation Team and we work in the Digital Projects Department at Duke. We also have a number of other parts of the library that are in our orbit that we work with very closely. Over here on the far right, the Rubenstein Library Digital Collections Collection Development Group which is a newly formed group as a result of the reorganization and they are tasked with assessing proposals for digitized materials from their library. We also work of course very closely with conservation, very important partnership. Conservation Department actually joins the Digital Production Center at Duke and they always work very closely together. We work with cataloging and metadata services to provide description of materials and then over here the Digital Projects and Production Department head is our boss, Deborah Kurtz, who's here, hi Deborah. And her boss is the IT director at the Duke University Libraries and I kind of put them there because if we receive proposals for collections that are not from the Rubenstein Library, it'll go through them as kind of an ad hoc Collection Development Committee. I'm gonna give a real brief overview and the animations were awesome on this. Let's see if I can actually, maybe I can get it to work and if not, I'll just kind of muddle through. But I'm just gonna give a real brief, a real quick overview of the content. Nope, okay. That we work with in the Digital Collections and kind of a quick, like a quick historical tour of Digital Collections at Duke. So one of the first projects that we worked on was a Papyrus Collection. They're all scanned at 150 DPI criminally. But we just got the news last week that one of the professors who was involved in this project at the beginning and the library have received a grant to place him in the library as a member of library staff together with two developers with whom he's collaborated on a number of projects over the years to sort of take up the mantle of working with the Papyrus and other early manuscripts that we have in the library. And that project was completed all the way back in 1995. This is behind here, the animation was awesome and now it's kind of as much grace under pressure as I can actually muster right now. So this is actually behind here, it's a piece of sheet music, very attractive cover. And I was gonna show you that we did a sheet music project and this was a point, we did this, this is actually preceded the Dublin Core Metadata Initiative in terms of history, the librarian, the cataloger who worked on this project, devises very elaborate and detailed metadata scheme for the sheet music. It goes on there, this is like 30 metadata properties in this collection and when I took over the position of metadata architect in 2002 at Duke, I started organizing these properties into the fields that looked like Dublin Core and the ones that were refinements of Dublin Core and so that's basically how we do all of the metadata for digital collections at Duke now. This one is a great ad, it's an ad for a tape recorder, 1950 and it's inviting people to use a tape recorder to record their Thanksgiving dinner so that they can preserve it and they can listen to it in the future. And when I saw that I thought, so when Duke Digital Collections was to digitize that, you probably don't need to get the release forms from Grandma and Grandpa but make sure that little Timmy and Susie signed theirs for us. This was the first of many advertising collections that we've done, we have very strong holdings and advertisements and it kind of introduced the product and company metadata fields which are common across all of these advertising collections and you'll see a couple more examples as I go through. 1999 we did a photographs, the William Getney photographs collection which William Getney was the first of a number of very important interesting photographers that whose collections we've digitized over the years and then right about then 1999 all these projects were kind of grant funded, they were kind of the program coordinators brought in, outside monies to fund these programs or these projects and then that all kind of stopped and for a number of years it was kind of a fallow period for digital collections starting in 2005, excuse me, that's okay, thanks. So around 2005 we created a digital production center that unit started doing some digitization again and I was involved in those efforts to get that program started, Shawn kind of joined up about a year later and we started doing some manuscripts, this is a draft from Leaves of Grass and it was kind of interesting that we'd never aside from the papyrus never really done any kind of manuscripts before and sort of filled the room up with nice equipment and hired some staff and started doing digitization in earnest. You can see what was going to happen with the animations, he's going to fill out as a timeline and spent like hours doing that. So we did also at the same time we did the World War II ration coupons, they're interesting because they're interesting dimensions, right, long and narrow but also when you're trying to model something like this we take the full sheet and this is kind of meant to fold up and tear away and then we do the individual coupons and then so how do you model that? What's page one, you know, something like that and so that was kind of an interesting project. We did a project of dry collodion negatives which if you're familiar with the material it's basically negatives that are made out of dynamite and you can't handle them in the library they have to be stored in the freezer and we couldn't digitize them ourselves so this is a case where we had to send the materials to a vendor to digitize and it kind of presented some interesting problems to us so that's going to come back to this later but this is the Sydney Gamble collection he was an heir to the Procter and Gamble advertising fortune and traveled in China in the 20s and 30s took four trips to China in the 20s and 30s took some astounding photographs and these are very popular and as you can imagine we get a lot of traffic from China for those. Next up one of the big projects we did was television advertisements we kind of worked with Apple on this one and it's harder to see but at this projection but right here you can see that we published this collection in iTunes because we kind of did it as a partnership with Apple and that was interesting we had to then model an album like this video album type that we could then serialize as RSS feeds and then I think we had 140 of them and each one of them had to be hand published by a staff member in iTunes and then later the internet archive asked us if they could ingest the collection and they didn't need anything we just gave them the data and they ingested it themselves so I've talked about advertisements and I've talked about photographs and now we have photographs of advertisements this is a collection of outdoor advertisements by the company that installed the advertisements they would take these photographs and use them as proof that they had completed the work they were contracting to do and this is the Atlantic City Boardwalk leases in the 1930s you can see there are two ads in here and what happened with this one was the archivist created this metadata for this collection before it was digitized we had no intention to digitize it and they just built this one table in an access database that had lines for every single ad that were in these photographs so when they come to a photograph like this they kind of put the date and the place and then they put in the information the product and company information for one ad and then they'd copy that row and paste it and then they'd add the specific information for the next ad which kind of created this modeling problem for us because up to then we treated every single photograph every single image as an item and now suddenly we had to treat an image as something that contained advertisements and so we had to create a whole new kind of model for this one and also the table was never that database table was never meant for a digitization project and we had to do a lot of untangling next up and I believe this was last in my review here oral histories where we had to have a different album model we published these in iTunes as well they had to have a different album model because the advertisements we had one file per advertisement but the oral history interviews have multiple files including PDF transcripts or Microsoft Word transcripts and so this was a little more complicated and we're actually doing more of these now so I mentioned all of this because this process of modeling things is really important to us and how we present these materials to the researchers and to users and we consider this modeling process to be part of the chain of provenance for the materials this is like the digital publishing part of the chain of provenance and I'm pleased to cite a definition from our now retired former colleague Steve Henson who was one of the kind of movers behind those early collections those grant driven collections that I talked about and he talks about provenance this is from the SAA's glossary of archival terms and he talks about provenance as the significance of archival materials is heavily dependent on the context of their creation and that the arrangement and description of these materials should be directly related to their original purpose and function and we feel it's really important when we are preparing these materials for digitization of publication on the web that we're able to reflect as much of the arrangement and description of the materials from the archives as we possibly can and I think that when we start to talk about discovery of materials in just a few minutes we want that work to be reflected in the discovery platforms that users use as much as we possibly can make it so this is kind of an interesting example if you can see this photograph we saw this we had no idea what was going on in this photograph this is from the Gamble collection so if you can see there's this kind of guy lying on the ground and there's another guy with blindfold who seems to be hitting him with something and it had this caption and so these Gamble photographs came with these labels every single one of them was in a sleeve with a label on it and when we sent them to the vendor we asked the vendor to digitize these labels because we had no idea I mean we couldn't look at these we can't look at photographs of China in the 1920s and unless it's the Forbidden Palace or some recognizable landmark you have no idea how to describe the materials so this was all we had and we decided when we were publishing this collection that we needed to publish these labels and we put this kind of disclaimer label text derived from Gamble's handwritten notes because this is how we got it from the donor who gave it to us and then Sean and I gave a version of this presentation a couple weeks ago at the School of Information and Library Science University of North Carolina and so we had no idea what's going on in this photograph and within five minutes of me showing this slide one of the students had like found it and explained to us that there was a game called Are You There Moriarty which involved lying on the ground blindfolded and he kind of caused like it's like Marco Polo but with hitting and I thought that was great and I thought you know if the world were only fair we could hire her like right there we just offer her a job so that was kind of a modeling challenge but now how do we model this this is an anatomical flat book that was used in the late 19th, early 20th centuries to teach anatomy it's kind of like a pop-up book as you can see and it's very elaborate and this is a project that we're about to take on and this may be the most challenging modeling exercise yet and then just real quickly a couple of other things that we are working on some newspapers that we're going to publish and newspapers have their own kind of publication issues and concerns we're publishing the Duke Chronicle and then some early manuscripts which is also a very complex project so there I'm done it again so this kind of brings me to the main issue of our presentation today which is to talk about how do we help enable users to discover these materials it's something that we've kind of wrestled with a lot over the years and as I said you know a couple slides ago we put a lot of work into the curation of these materials and we really want that work to be reflected into assist users in finding our materials as much as we possibly can so I'm going to tell you about a project that we work on at Duke the tripod 2 platform it's a do-it-yourself digital collections platform I'll explain a little bit more about how what the architecture is in just a moment Sean and I are essentially the leads designer and the lead programmer on this project which as you're all aware from the academic library environment where resource on developers is great and numbers of developers are scarce being the leads means that we're the only developers we get help from people other contributions from folks now and then but we've basically been kind of the caretakers of this platform for a while so it supports a number of different content types digital collections, archival finding aids and a couple of others which I won't go into this is kind of our this is what our software stack looks like and this is kind of all of it they're going to go into everything that's on this slide but we have XMLs, the data layer Solar and Python Solar is an indexing search engine platform and Python kind of in the middleware we use Django web development framework and then we kind of have all of our different content models here we kind of build out of Django and then the UI layer is HTML and JavaScript and CSS now I'm going to just pull that red blocked out section out to highlight some things and this is just bringing out that one section and just show you that this section here is where we've divided this into discovery and access access is really when you get an item on a page or a collection portal page and discovery is where you're getting search results you type in search terms and you get results and this block here is really kind of the most complex part of what we do this part here is relatively easy for us this is more complex and Sean's going to talk a lot about usage data but one of the things about it is that it's used in 12% of the visits to the digital collections website so every time we want to do anything with this application Tripod 2, we spend most of our time kind of working in this area rewriting the same application piece over and over again and we really kind of got tired of it, we want to get away from it but it's also, it's just not in the end that important to our users so we thought about plan B and we sort of tried to use the system in DECA that we used for our catalog for our library catalog it's a commercial based faceted search engine a lot of e-commerce sites used in DECA but the consortium of which we're a member in the Triangle Area of North Carolina the Triangle Research Library Network has an instance of in DECA that's supported by TRLN and that's Duke, UNC North Carolina State and North Carolina Central Universities and so we thought that we would, it's also the platform for our library catalog so maybe this seems kind of drastic but that animation that was supposed to drop out but so what we thought we'd do was basically like a discovery transplant here that was the idea we were going to take out that custom built piece that we did and replace it within DECA and so we worked on it and it turned out it turned out pretty well it's kind of what it looks like now with a little bit of work with the TRLN staff we were able to integrate our thumbnails in there it's got facets over here you can even see the collection context there for search results but it's kind of limited and it really didn't give us everything that we wanted and in fact the result the end result was really it kind of worked out more like this it really just takes a little nibble out of that functionality and doesn't really bring that much traffic to the website so then we thought well we've kind of been experimenting a little bit with using Google as a we've used it as a website search for a while and so we thought about experimenting with Google and that's really kind of what we're here to talk about today so we had this idea well what if this will work and really an experiment now the three kind of pieces of the experiment are schema.org which is a technique for embedding structured data in HTML pages it kind of gives you the use of things and not strings it's really kind of a linked data platform and enables potentially theoretically enables the use of rich snippets which Sean's going to show you a little bit more about but rich snippets are when you kind of search for a Best Buy in your neighborhood and you look at the Google results and you see the phone number and the address of the Best Buy Google's using rich structured data there from the Best Buy's web pages to display that information for you in the search results Sitemaps is another framework that allows you to do targeted indexing with Google you're basically telling Google what you want it to index and when you want it to be indexed and site search which is a way to pull the Google tools into your own local environment so I'm just going to leave you with one last slide and it's really three numbers that we use to think about that I use to think about this problem 80% of the time that Sean and I spent developing this platform involves working with the discovery functionality this is estimated 12% is the number of visits to the digital collection site that involve the use of this functionality and 70,000 is the approximate number of items in digital collections if we think back to that collections grid 70,000 we put a high level of stewardship on those materials but there are a lot of other resources in the library collections that are either unique to our library or over which we've kind of provided this high level of stewardship and can we pull those materials into an approach to discovery as well so that in the end the problem is how can we extend the impact of our work and I kind of mean this in two ways the first is how can we reach more people but spend less time working on the platform and number two how can we pull in more of the library's resources into a search framework or enable other content providers in the library besides digital collections to build a particular framework that would allow their materials to be searched and discovered by users together with the digital collections so with that I'm going to turn it over to Sean who's going to talk about the solving this problem in more detail I'm going to try one more time to get the slide show to project could be you're a lot nicer to Google than I am it's loading oh maybe that's it she just wait until it works that was probably it patience okay so I'm going to talk about discovery I'm going to get to schema.org in a minute but I want to sort of set the table first Lorcan Dempsey from OCLC one of the great thinkers of library discovery really helps us frame the problem that we're trying to solve and gives us a call to action the problem is these materials are great there's some great stuff in there they don't really have their own gravitational pull that's so compelling for everyone to try to find them there right so it becomes important to actually care about SEO to actually syndicate your metadata to these other hubs that are in users more natural workflows we can't just kind of go it alone and expect people to go find the library website and find our materials Dempsey says libraries have to take a more active approach here so do something we can't just put our content up expect that Google will have the correct representation of it expect that any of these other sort of clients will then interpret the way that we want the information to be interpreted Von Dessamblin he mentioned this at his presentation the opening plenary this is another Dempsey concept of inside out and outside in collections so outside in collections all library resources are not sort of created equal they don't need the same strategy behind discovery so books and articles the sort of bought and licensed materials are outside in resources there's things that are out in the world that the library is then trying to provide access to to its internal audience right but this corner that Will talked about that we're in we're working with inside out resources these are materials that are distinctive they're internal to us, to Duke only we have them and the discovery is we want to optimize discovery for everyone in the world outside of the walls of the library so different strategies involve different approaches different ways to think about discovery in this particular domain that we're in so in the past year this is even with a passive approach only about a quarter of the visits to our digital collection site are coming via anything that we're doing on our sites whether it's the library catalog which all of the digital objects are accessible through and the collections, the library website, librarians research guides no matter what we do it's largely the traffic coming from other places so places like referrals so referrals from hubs like Wikipedia, Facebook and then search traffic is a very significant number so over 30% more traffic is coming to this stuff through search engines than is coming through our own efforts on our own sites and obviously you don't need three guesses to figure out which search engine is dominating those statistics Google matters the most so Google is important strategically Google is important for us to think about to actively create a strategy around it we have to optimize the representation of our materials in Google when we talk about that it's SEO but SEO is different than it was five years ago right some people hear SEO and they cringe and they say well that's all about nefarious schemes to set up link farms and keyword stuffing and SEO doesn't have to be a bad word and particularly now within the last few years SEO has really evolved to the point where linked data and structured data within your pages is sort of the hot area in SEO we're really interested in that we're not interested in trying to game Google or trick Google into believing things about our collections that are not true we are interested in most accurately representing semantically what objects we have in our collections in a way that Google understands so the first challenge is telling Google what pages of ours and which images to actually add to the index Will mentions sitemaps.org was an effort that Google collaborated with other search engines on it's sort of an XML standard to indicate exactly which pages you want indexed and when and what images are within those pages that you want indexed and then Google has really good webmaster tools that enables you to sort of submit those sitemaps and see how much of your stuff is indexed a whole suite of tools also some structured data testing tools that are associated with that we're getting a lot out of using webmaster tools I can't believe we hadn't used it until just like this previous year highly recommend it and so if we've indicated which pages we want indexed the next challenge is how does Google know what's in those pages what kinds of objects are represented so a human being looking at this page can very easily know that this is a photograph that has been digitized it's a photograph that was taken by William Gedney in about 1955 it's a photograph that's from a collection called the William Gedney photographs and writings collection all of those things the way that this site looks a human being can understand a machine can't without a little bit of help so this is the old school way of markup structured markup there's some structure here right, there's a heading there's an image, there's an unordered list with a couple of list items there's some structure but there are no semantics the machine isn't going to know that creator Cole and Gedney William Gale is actually that's the guy who made this thing so enter this is not quite the approach we're talking about, infrastructure data in HTML this is a hot area it's an area that the search engines are all aboard with burners leads semantic web concept we're not 100% there at the vision but things are accelerating things are getting a little easier we're seeing a lot of progress getting closer and closer to that vision RDF not a new concept late 90s, very good foundation for the semantic web idea also kind of complicated if you're coming at it like if you're a web designer like me it's verbose it's complicated to pull off over time things are getting there's more convergence with RDF and HTML representation and there's also it's just gotten easier the syntax has gotten easier there are flavors of representing structured data that are just way easier to sort of wrap your head around than full blown RDF especially RDFA light which is a newer flavor but it's so simple to do that I can't imagine not doing it anymore for any future projects so for any idea to take off, structured data for any idea to take off it has to be easy enough for people to actually be compelled to do to be able to do and then there also has to be some sort of killer app or some motivation for people to do it so it's one thing to have structured data in your pages but if you see no returns or results from doing that why bother so killer apps here too two companies who really care about linked data right now both are important both provide some motivation Google about a year ago created a knowledge graph and the slogan is awesome things not strings it's such a great way to sort of summarize in one small slogan what linked data is and what it can do we have a search for James B. Duke and instantly on the right hand side we're getting information about his vitals, who he was, who's related to him all of that the link data that Google ingested from somewhere on the web whether it was dbpdf freebase Google is caring about and using this link data in a way that enhances their search results it's in their best interest to do it and it's in everyone else's best interest to provide information that they can then use so not only the knowledge graph presentation but you'll see this concept Rich Snippets and Rich Snippets is not just the name of the guy at Google who makes your results look pretty it's a really compelling presentation of results based on what's actually represented so example is this vanilla french toast recipe where right in the snippet we can see that this is a recipe that's going to take you 10 minutes to make and it's got 332 calories you can see every ingredient so that's really helpful information to be to see right in the result if you're looking for recipes and beyond the actual presentation that same structured data Google then provides search tools to then filter your results so in this case if I am looking for a french toast recipe and I love cognac and I don't like bananas I can check those yes and no boxes and I can filter and refine my results to just things that I would like so that's really helpful Facebook is created a vocabulary that you can implement using RDFA that's the open graph so it's a way to make your web pages play nicely in the social sphere turn web pages into social objects we're using open graph tags now in our site and it makes it really easy when I want to link to an object from digital collections Facebook using the open graph tags knows exactly what the image is and what description to use and so both Facebook and Google on the sort of structured data train so we have mechanisms for putting structured data in our web pages and the question becomes what vocabulary do we actually use to describe the things that we're talking about on web pages. Schema.org is still relatively new a couple of years ago the major search engines got together and created a vocabulary a vocabulary that users people who have websites can use to describe the things that they are representing in their websites the vocabulary is very search engine centric so there are things that are kind of universally applicable to a lot of various websites so things like creative work is a type of object that you can describe using schema.org vocabulary and you sort of have these subclasses of things so everything and everything can have an image a name, a description, a URL and there are more specific kinds of things like a creative work is a specific kind of thing and there are more specific kinds of creative works that then have a couple other properties that you can use from the vocabulary to describe what they are it's not a vast vocabulary that's going to cover all of people's needs for describing things on the web right it's very limited but it's not there that we can already use to mark up the materials that we have and I'll share a little bit more of that so we're talking about rich nippets now what kinds of things from the schema.org vocabulary actually give you rich nippets on the search on the big google search results screen there's actually a lot events, music, organizations people, products, recipes reviews software videos breadcrumbs from sites that sounds like an impressive list a lot of people can realize benefits for marking up these kinds of things with the kinds of digital objects that we're trying to represent we don't have a lot of alignment here with things that we will instantly get rich nippets by doing but as Will mentioned one of our aims is improving our representation in big google but we're also working to have this localized google experience where we can build our own rich nippets regardless of whether google turns them into rich nippets for everyone on big google so using schema.org vocabulary the item page that I showed we can at the very root level of this web page item page is a type of a type of web page which is a type of creative work the item page has a creative work represented in it the creative work is the photograph the photograph has a media object which is the jpeg that is here the digitized version and the creative work has all this metadata at the bottom which then we can also use schema.org properties to describe these fields here's an interesting concept or interesting use of schema.org also to relate this particular page or this photograph to the collection that it's from here is some rdfa where we can say property equals schema namespace is part of to make this sort of predicate that's saying this is a page that is part of this other collection page and then down in the metadata where some of the magic happens more rdfa property for example the William Gedney's name in the metadata field we can just have property equals schema creator and that instantly maps his name to the creator field for the creative work in the schema.org vocabulary we can mix things in here so you can see for an identifier we use dc identifier to be representing some of the metadata in Dublin core and you can even mix we haven't done this now but we can very easily for William Gedney's name put schema colon creator space dc colon creator so we're not just using schema.org as a replacement for the Dublin core that we already have it's they can kind of work in tandem and rdfa makes it really easy to sort of mix these vocabularies in that way so if we've got things represented we're using we're using vocabularies we want to use we've got the images and the pages index that we need indexed then the challenge we turn to is putting Google back in our site putting Google in our stack and if Google is part of our stack it's actually pretty good for us because it makes it really hard for us to neglect our presence on Google and we really really need to care and be aware of what our users are experiencing as they're using Google to find our materials right so there's other people that call this dog fooding we want to use the products that our users are expected to use and then we'll gain some empathy toward their experience I think before we started really seriously looking at this there were times for too long that for example we had been accidentally blocking Google's robot Google's images crawler and we had nothing indexed in Google images and you know this is the kind of thing that you notice if you're actually relying on Google as part of your stack so Google Custom Search if there are concerns that Google is the anti-library or it's the enemy it's not a huge stretch that we'd be talking about adding Google to the stack we use Google for the library website search Google Custom Search that's free it powers it it's great we get relevance rankings and digital collections are also indexed in the library website search and they come back we're already sort of a little bit there we're taking it a little bit farther so we could just through Google Custom Search free with no structured data and let that be the digital collections interface it would get us part of the way there it's pretty good but what it would take away is this ability to search on particular properties advanced searches, facets and finding by properties those are important functions that we need to support just out of the box Google Custom Search not quite good enough Google site search is highly customizable version of Google site search highly customizable version of Google Custom Search it costs a little bit of money it's not prohibitively expensive they offer several APIs to be interacting with the data which is cool and also frustrating at times because we have so many choices here's an example of the work that we just kind of been tinkering with this is using one of the JavaScript APIs where we can do a search on Pepsi and then make our own rich snippets you can see here we're pulling back particular properties that we expressed using RDFA schema.org properties and spit them back out into the interface here's an example using the XML API which is an example to show that we can query Google and get XML representation of what they indexed from our site from our structured data that we had embedded and we can develop against this XML and we've experimented with this as well and then there are other goodies too so image, Google Image Search also you get the APIs to the image search tools a lot of our materials are image heavy and have really compelling images Google also gives you an API to use their filtering results for images to have things like search filtered by faces so this is just faces of digital collections just based on what the Google Image index has determined to have a face in it some of these, no librarian is going through and saying there's a face, I've got to add a face metadata field, it's all automatically determined by Google's algorithms same thing with color, search by blue so these are tools that we're getting by using the Google API that we don't have to develop ourselves in metadata we don't have to create so, so far I'll share some of the lessons as we've been working with this data and working with this idea of adding Google to the stack and relying more on it as a more central part of what we're doing for discovery for these materials it's a little bit harder than we thought I think even two months ago we would have thought that we would have this live on the site and you would all be able to be clicking around on it as you're watching this presentation, we're not quite there it's harder than we thought because I think of all the choices that we have to make, one of the choices is how to express the structured data, so there's microdata there's RDFA and RDFA light we've experimented with each of these approaches and each one Google picks up in a different way and presents back to us slightly different structured XML to work with so we're still even not 100% sure which one of these three is the most optimal for our needs so there's the mappings, there's the vocabulary of schema.org and which of the fields actually makes sense and are inaccurate kind of crosswalk to what we have and those are decisions that we need to work with there's sort of the concept of how to mix the vocabularies correctly and have all of that represented right in that presentation layer as Will mentioned, the data layer everything's sort of Mets and we use this qualified Dublin core, we can't spit it out on the presentation layer and use that but we want to mix, we want to use the vocabularies that we want to use schema.org in addition to those library specific vocabularies and then the different, the three API flavors has also been a challenge because they all kind of give you different capabilities for interacting with the data and for developing against it so just a lot of choices to work with not really well documented which is also a challenge for us the indexing isn't on demand so as we're sort of doing this trial and error a lot of times it's like we'll build a test page and then submit it and then we'll have to wait three days to see if it actually did what we wanted it to do and then the images are getting indexed at a really slow rate the pages get indexed pretty quickly but that's frustrating for our testing of the image search tools and then the big Google rich snippets maybe we kind of expected this but they're mostly elusive we don't have a lot of examples of things that look differently in big Google searches now that we have schema.org markup on our pages I will share one success story this is that AdViews collection that Will talked about where we have these albums of videos of commercials right this was pre schema.org we had really the standard snippet where you have the you know the title and a little snippet of text and then after we had schema.org tags I think we marked them up as video gallery objects then when Google presents this it says how many items 20 plus and then it actually starts listing the commercials that are included so this is kind of an encouraging example that some of the work will lead to better snippets on the big Google experience and Google promises to do this for more kinds of objects that are marked up using schema.org vocabulary in the future so we'll see if we can get more than this so what's next somewhere we've got to build it so users can actually start using it in the live site that will hopefully be within the next couple of months we'll feel like it's cooked enough to share it and hopefully you all might take a look at that point and give us some feedback we have some partners who we work with on other projects who are also looking at this area NC State Libraries as usual is on the cutting edge there are two Jason Ronaldo is the leader of the digital libraries initiative there is one of the big leaders in this field right now looking at schema.org and digital collections and we I think we're meeting with him soon to sort of compare notes and see where there are collaboration opportunities everything from establishing some best practices to there are other collaboration opportunities to add new kinds of properties and new kinds of objects to the schema.org vocabulary for example when schema.org was just released the newspaper industry was actually working on a standard called R news that was the vocabulary for marking up online newspaper articles right after schema.org was released they ended up the newspaper industry got all of their vocabulary added to schema.org official vocabulary you know you have how many standards do you want like it was nice that they could just use what there were and develop add to a standard that others could use so we need to measure is there any impact we know that rich nippets is a possible outcome is structured data actually used as signifiers of relevance does it help with click through when people get rich nippets we're starting to measure these things we have baselines established and we're measuring every quarter and taking screen shots of the search results for key kind of landing pages and then locally we assess does this actually give our users a better discovery experience and we have a lot of ways that we do that at Duke and then do we as developers end up spending less time maintaining this part of the stack if it's Google and not solar and so here's where we are right now we've kind of done the first three things we've got the baseline metrics we've got schema.org markup we've got Google indexing the materials and we're working with the results and there's a lot of kind of back and forth between steps two three and four where it's trial and error and we try something and we adjust hopefully we'll get to deploy on the site within a month or two and we will be working with engaging other partners on best practices and talking about the vocabularies and then obviously just assess what we're doing and continue to handsome refine it so that's where we are hopefully within by the summer we'll be able to share a lot more with you all about what we've found and share some links so you can try it out and let us know what you think so I think that's all that we've got and we've got about 10 minutes. Okay thank you all for coming and for those who stood in the back