 My name is Daniel Pitti. I'm Associate Director of the Institute for Advanced Technology in the Humanities in my colleague to the right over here is Brian Tingle from the California Digital Library. And what we're going to talk about is social networks and archival context which is currently a project, but I'm going to discuss a little bit about trying to turn it into an ongoing program from R&D to a real program. And the way we're going to go about discussing it this afternoon is I'm going to talk about context and overview for SNAC as we call it and a bit about the research and demonstration phase of it. And then I'm going to turn it over to Brian to talk about the history research tool which is part of the research and demonstration component as well, but it's actually the more interesting part of what you'll you'll see today I hope. And then finally I'm going to talk a little bit about where we are with respect to turning what we're doing into a cooperative program. If I was assured that all of you were archivists I wouldn't do the following couple of slides or I might anyway and insult the intelligence of most people, but I think it's safe it's good to sort of tell people what archival records are. So the standard definition of archival records is that the records are the byproducts of people living and working as individuals and organized groups and in families and records frequently are tools they're utilitarian. They're the things that when we're going about doing our jobs they're living our life. They're things that we put together that we use in the process of doing the things that we do. So if you're building a digital library it's not the digital library it's all of the email and discussion notes and everything else that goes into planning it and implementing it and the rest. It tends to be those kinds of documents. And so records necessarily then document people living in and working. And then since we exist in social professional intellectual context we know people here and now today we interact professionally we have family relations etc. But we also have intellectual relations there's lots of people that we read that we've never met socially but they still have an impact on us and sometimes these people are long dead. So that falls into the context of who we are and the records document that interaction among us as social animals. So the way archivists go about describing records and the way they've done so traditionally is they describe records in context they may not always have called it that but we will soon make records in context not this project but something else I'm involved in records and context will become I think a ubiquitous ubiquitous way of describing how archivists do their describing. And at the core of it is the principle of provenance which is to say you know all the records that Daniel Peay creates over the course of his lifetime that those are all kept together and they're described together and one attempts to maintain the original order of them which is not necessarily its physical order but it's the logical order or an interrelation they have to one another with respect to how they interconnect and are used by me. And then they also describe the context so you describe a bit about if someone were to describe my records they'd say something about me so you would understand who I was what the professional activities I engaged in etc. in order to understand those records. Archival descriptive practice to date and it's likely to remain dominated in this way is a single apparatus called a finding aid. These things are pretty long and they describe all of the records that are kept together for one creator in one document and they start by describing the whole and then descend and describing parts of the whole and only rarely do they actually get down to item level description that's that's more the exception than the rule and in part that's because of economics the economy of it but it's also because if you've described everything at the item level you you you lose the context within which those records sit in you if you lose that context you lose sense the sense of what the records mean. And in this description in this single apparatus they describe the creators of them as I mentioned a moment again a bit of biographical information the name what occupation or occupations they had etc. And then many of the people document in the records which is to say not the creator of those records but people that the creator interacted with in one way or another are also documented in those records. So these records become really primary evidence of sort of understanding the social networks within which people exist in the primary evidence of what that social network was. And so I've come to think and I and I borrow the this hyphenated term social document network I borrowed it from Alan Liu this English professor at the University of California Santa Barbara who's very much interested in historical social networks but also their connection to contemporary social networks. And the way in which we're bound up together through the documentary evidence as well. So archives represent a vast social document network connecting the past to the present to the future. If I had connected to sound that's when you'd have some sort of soaring music to go with this. I should have said it more dramatically in order to evoke chills or something along these lines. So what we've said about doing and I'll describe what we're doing in a moment. We started in 2010 with funding from NEH and then overlapping that we got money from IMLS. The first part was R&D from NEH. IMLS was more on the the planning side. We were already beginning to think about how can we transition this to a cooperative program. And then additional money from the Mellon Foundation both for the R&D and also for additional planning. And the partners in this are the US National Archives and Records Administration, the University of Virginia Iath, University of California Berkeley School of Information and the California Digital Library. And again two complementary activities research and demonstration and cooperative planning. So the R&D objectives to demonstrate the data describing people existing in archival descriptions. So what's embedded in these monolithic descriptions can be extracted out. So we can take this existing data and we can use it to address the challenge of finding, discovering, locating and understanding distributed historical resources. And at the same time we're also, it wasn't an initial purpose but it arose as a purpose once we saw what the results were, was to lay the foundation for an international cooperative for essentially maintaining this data that we did extracted out and interrelated and reconfigured. So for the R&D we had 2.2 million world cat archival descriptions, archival description being loosely defined because there's no precise way of getting these out of mark records. So some of the things you'll find in SNAC are in fact not strictly speaking archival. But they hover around the edges. Nearly 190,000 EAD encoded finding aids primarily from the US and UK, mostly from the US, about 30,000 from the UK and a handful from the French for experimental purposes from the BNF and the CCFR which is a national run by the BNF for the academic research archives in France. And then 300,000 British Library authority records that are associated with the manuscript collections of the British library dating back to before the common era. And NARA authority records, we have agent descriptions from the Smithsonian Institution Archives in the New York State Archives and a variety of other things as well but that's the bulk of the data. And so what are we doing in the R&D? Is extract data out of those monolithic descriptions or in a symbol or migrate that data into encoded archival context, corporate bodies, persons and families which is an archival communications standard. And so we're using all of that data I just described a moment ago as source for this. Then once we have those CPF records we match them against one another although that's not quite true. That hasn't really happened because they've primarily been matched against existing authority records in the virtual international authority file. And from VF we enhance the descriptions that we have with normalized entries that have been gathered around adding alternative entries and titles from VF and also picking up some links to Wikipedia WorldCAD identities and other sources. And then finally and this is the part that Brian will describe is we're creating a prototype historical resource in access system. It's a resource because it's full of biographical information. You could just go there for that. But and it's an access system because it also provides integrated access to all of the archival descriptions that we used as sources to extract these names out. So the primary challenge in all of this from a technical point of view. I mean at the back end of this at the extract a symbol there's lots of challenges and Brian can describe the lots of challenges for the prototype research tool. But one of the big ones in between is the challenge that we pull out names and they're different names for the same person in different people with the same names. It becomes the issue of identity which ultimately becomes a rather profound epistemological challenge. It's most certainly a challenge for computers and it's a challenge for people. And a lot of times it's really a challenge because of the sparseness of the available evidence because name strings in and of themselves are weak identifiers. So in the original in that first part of the process we extracted out 6.3 million CPF records describing persons, corporate bodies, and families. The bulk of those being persons corporate bodies lagging behind in families. Fairly small number. And then after the merge processing up against the VF records about 3.5 million but this is slightly out of date because we're still loading these things through and what Brian's going to show you a number that has that up to about 3.6 million at this point. So it's a whole lot of them. And so over to Brian and while I'm switching over for Brian, what was it I was going to say? I won't say it. Brian's there. I don't know how in Acrobat is it under view. Where do you get to the full screen and it's under view. I believe you have a full screen now. Hello. You want to follow along at home or on your phone? You can't actually view the site if you like Google snack. You could get to it if you don't want to type this big URL in. How many people have you did people see the first version of snack? So a few people have people seen the second version of snack. So you've seen what changed somewhat. This is a home page of snack. When you go to it a good fraction of the well about one and a half percent of the records we have Wikipedia thumbnails for and so you that's about 40,000 records and you come to this page and you can sort of refresh and get random samples of records. Doing a search there's autocomplete against all of the names in there. In the this page has a lot of stuff going on but we tried to sort of color code and use icons to hopefully make it easier to under for a researcher to understand what's going on here. So you can see we've got this a person family and organization icons and we use that in this record type column. So hopefully that makes that clear of what they're looking at. Some of the records have biographies. Some of the records have Wikipedia links. So we also have limits now on that so you can drill down to those records and that's all color coded and icon coded in there. So hopefully it's a lot of stuff going on here but hopefully it makes it easy to digest. So we've done a search for George Washington. It looks like this first first one's our guy here. I want to also mention we had facets in the old site but we sort of redesigned how the facets work in this site and we've added a facet for location. So if you want to try to track down where your records are showing up in the snack site that's a tool to try to to see where your records have ended up. So you can do that facet on any search. There's another trick on the site. If you search for nothing you get everything. So just a blank search will bring you back all the records and then you can also do the browse from there you know if you wanted to look at location or the other browsers. And we do have a A to Z browse of three million records. It's just as a list that you can go through. So now if you go down into a record this one is a little abnormal because it's got a lot of maybe same as is. Normally that's not taking up so much space but this is the record where we've merged stuff together from all of these sources. So we've got about 2,200 archival collections that have some sort of mention of George Washington. And this is one of the areas but in the research Rachel Hugh did this she identified as one of the most useful sections to the researcher is these links to the collections. This is somewhere where we still need to do a lot of work to make that really easy to use when you right now we've got 2,200 here. I think we've got one record where we're going to have 24,000 related collections in the Henry is that it's 30,000 related people and organizations. Okay the people and organizations are getting okay. Anyway there's right now these are just big lists and so there needs to be some sort of filter on those to make those so that they're with these larger records that they're really going to be useful to people trying to find stuff but so other things that we pull together into here you can see that the thumbnail from Wikipedia that has a link to the right statement. I'll talk about how we pull that in. Yes that's what this is different. Okay thanks. Well what is different about this? Yes this this view we've popped open the related external links so we have links into archives grid if there was a hidden DPLA you can get to Wikipedia and other authorities in this section. Here's an example this is 1,700 related names it's just too much to browse that you just pop on that and it opens up. We are pulling in a lot of alternative names if you can see back here there's this alternatives name link so if you click on that it does this pop up with all of the alternative names that are coming in from V off mostly. We've also added this current our collection locations pop up and this is something too that we've identified is could be very helpful for researchers. We don't right now you see a couple issues here it's an authority system but we have no authority on the locations right now so if there's some duplicates and some weird things going on but then the other thing that we want to do is get the actual geographic locations coded for these collection locations. The idea is that a researcher would want a to be able to do a map of where their research interests are collected around and use that to help plan a travel grant or something like that and see if there's you know geographic clusters of their of their research interest. There's a graph that we had going on the first snack is now going in the second snack we found that this this graph while it is sort of cool to watch it animate and you can click around in there it's probably not that useful to researchers for a variety of reasons they want to be able to see sort of color coded what the relationships are and be a little more be able to do more more things in it that with this particular JavaScript info viz toolkit and the way that this it's working with canvas makes it really hard to really tweak what's going on with it looking at another toolkit that's called viz.js it's it's a JavaScript version of something I forget the name of it that comes from Stanford that uses the dot language just shoot I can't remember the name of it but this is a JavaScript sort of clone of the same language the way that it's written is much easier to customize so if we go further into the graph visualization workflow this might be a better toolkit to use and this is just viz.js versus info viz toolkit what's the other one another change that came out of the the user research before in snack when data was missing the records sort of looked broken so we added some really clear hopefully placeholders this the the biographical notes are just not available it's not that you've gotten to a broken record so hopefully that's clearer for researchers talk a little bit about some statistics this probably isn't readable we've got a sort of a stats page on the index you can see here we've got a 3.6 million records in the index currently about three and a half percent of those have some sort of have a wikipedia link that we've pulled in through V off and then a fraction of those have thumbnails that we've been able to grab out using a sparkle endpoint and what else did I want to I was just sort of interesting here with only 83% of the records have a location and the British Library and Australia and this other that gate that's in Britain to right University of so we have a lot of international right there on the top for some reason we've been keeping Google analytics on the site since we launched this is visits per day so some somehow in January we started getting about a thousand a day at least one day a week I don't know what what happened in January we still have the old site running it's getting almost the same amount of traffic so we have to do something there as far as figuring out how to redirect the old site to the new site or and then this is this is I found this is really fascinating to me this is on the newer site we enabled Google demographics feature on the analytics and according to this over 30% of our users are 65 and or older and I'm just fascinated by what is like what is so yeah Bill O'Reilly show makes me nervous yeah I don't know why most of the hits are coming from Google something about what we the words we have great affinity towards searches of people in an older demographic I think Medicare could advertise on our side or art or something like that so I'll talk a little bit about some of the technical challenges the biggest thing when we went from 128,000 records to now more than 3 million records in XTF XTF is a system that we've been using we developed at CDL and we've been using probably for 12 or 13 years now it's a integration of Saxon which is an XSLT engine and Lucene which is a search engine they're pretty closely tightly integrated will probably never upgrade Saxon past 8 because of the license change between 8 and 9 will probably not upgrade the Lucene but it's something that CDL is using in production will probably be using it for another 10 years at least we're committed to supporting it for our own uses where it's on GitHub now we accept pull requests we've gotten a few pull requests from the community of XTF users there is a snack fork of XTF the snack fork is pretty much the same as the regular fork except in the XSLT style sheets that are customized for the EAC CPF rather than the generic XTF and then there were a couple places in XTF where there was like a series of eight nines in the Java code that was sort of hard-coded in there and I had to add a couple of nines to those strings of nines and then I was able to get it to work with a number of records but we never figure out Martin who wrote it didn't remember what those what that number meant but we just increased that number and it was able to work and had to add a lot of RAM to it takes about 14 gigs of RAM to run the Tomcat that XTF is running in and we're about 41 gigs of XML right now that it's indexing this next column here I put Bauer and grunt it's actually Bauer and middleman that I was using on the front end for this Bauer is really cool if you've never used it before in front-end development it's a packaging system so that you don't have to pull all of your JavaScript and CSS libraries into your repository you can just put this list of here's all the libraries I use and it knows how to go out and grab them and install them so that in the the new snack to make it work to the other one thing I had to do is change it so it only does 20 requests 20 results at a time and those smaller chunks are coming out in JavaScript which are feeding into the slick grid which lets you do sort of the infinite scroll of three million records and then a tinker pop is another piece of the stack it has this thing called rexter which is the server that is serving the the graph visualization and it's what we use to create for the first snack version of RDF that we published of all of the records and yeah skip the aqua hire so XTF just a couple of other notes about how what one thing at CDL we're moving from our data center to Amazon and as part of that Martin who works on XTF was exploring how to cluster XTF we don't have a fancy clustering like solar or elastic search where you can run a lot of instances that sort of shard the index around but he's developed a strategy where we index on NFS and then the web workers tie into that same index so that's a clustering strategy now we have an XTF which we didn't have before and then I found out another trick that Martin uses that's not in the main XTF which is something that we're going to need for snack Rachel found that the people want to see the updates to come through to the site pretty quickly and right now the other XTF sites I've been working on we do a daily batch reindex but there's a sort of a queue based mechanism that Martin is using any scholarship that we're going to adapt into snack and I'm going to see if that I mean that'll become available to either in the snack fork or maybe back in the main XTF fork so that'll with just the number of files we have the way the indexing work takes takes a long time right now so the Wikipedia thumbnails I mean a thumbnail you might think well that's probably fair use but a lot of the thumbnails or at least some big fraction of the thumbnails in Wikipedia are under attribution licenses so I wanted to find a way that we could at least link back to the attribution page and I couldn't get that out of the normal just sort of RDF dump but I found I could write a sparkle query that would grab both the thumbnail link and the link to the attribution URL so there's a pre-processing step that runs and it sort of grabs this link data it's also doing searches like against DPLA to see if there are any hits and so I was thinking about this pre-processing step and then you know thinking about this page and all of the different sources of information that need to be updated and kept up to date in there and I had this sort of realization I guess there that this thought that there's there's sort of two tensions that are related but not exactly the same that are going on as we try to think about how we keep these pages updated going forward and one of those tensions is of centralized control versus decentralized control we heard a little bit about of that in Brewster's talk today I think and there's another dimension of people directly editing things versus this idea of sort of this linked data cloud where data sort of cooperates together into this bigger entity and I think that snack really has to work in sort of all four of these quadrants you know we've got we've got a lot of different you know a lot of different philosophies and challenges and a lot of data to keep up to date so I think somewhere we've got sort of all of these quadrants going on of sort of a traditional model some sort of crowd sourcing pulling data in from linked data sources and being sort of an authoritative source that people can link into we do have one change between the old first prototype and the second prototype is we do have persistent URLs now for all of the records which we didn't have in the previous version so this is my thought on trying to some framework for trying to think about these different sort of modalities of updating and maintaining the records mentioned just one little technology piece I've been playing with with this Amazon web services has something called Amazon Cognito it is a written originally for people developing iPhone and Android applications so that it would provide a mechanism so they could use any open ID connect identity provider to authenticate in the app and then it also provides a syncing service so that the app could sync between your iPhone version and then the user checks it on their Android it would sync the data back and forth they Amazon recently released a version of this that works in JavaScript and you can actually authenticate with it into and then give a token to the JavaScript client that can access different Amazon services and then it can also sync data back and forth so the demo of logging in this might be something that is relevant to something like saving book bags for users if they wanted to save a bunch of records and be able to get back to them later and it's possible that this could play into the access controls as far as making the site editable by archivist contributors and so it on the website the snack website we've got some of the forthcoming features that we're working on the long lists of archival things I think is the most most critical having some way to really work with those lists of 2000 names but if you have thoughts of other things that we should be working on other features that we should have you notice problems with records there's a little question mark in the lower corner of every page and you can go on there and send us feedback or vote on features that we should work on and with that I'll give it back to Daniel thank you Brian so I'm going to go through this rather quickly so you know fairly early on we decided that in conversations with people that you know there was a real rationale for not just doing all of this and then moving on to something else and letting all that data go away that we painstakingly put together and in particular because archivists for years have been talking about wanting to have the ability to do cooperative authority control and to share their work so we thought well you know this has been a good seed and foundation for that let's you know start seeing how we might do it and the the rationale for this is is really economic on one hand and that you'll see when you go into snack that you'll find especially for the super connectors the super nodes that you know you'll have 35 40 50 80 different biographical statements about someone when one would do so if someone went in and just made a good biographical description most everyone else would be content with it or might tweak it to add some detail so there's an advantage there and why that's possible is because the way in which all of these documents are interrelated is one collection's creator is is referenced in another collection so all of the documents that archives have in fact blow through the boundaries of their archives and are interconnected to many other archives and so what we have in mind doing here is really creating an international internet based linked archival authority system which is you know I think for anyone who knows what linked authority systems and will recognize that that's a non trivial task and then you should wish us good luck but I do think the technology is there to be able to do it and it's probably more of a social challenge than it is a technological challenge for research users the rationale for keeping this around and to you know develop it extend it to add to it to make it more comprehensive is is the integrated access to distributed historical resources and we know from our own user studies and conversations with researchers that that you know I you know one scholar exclaimed to me you know my gosh I'm looking at two years of research presented to me on the screen and that sort of research economy is extremely appealing and also the fact that it gives for the first time and trying to build these historical social networks it's the kind of thing that historians do all the time and they painstakingly put it together you're trying to see who's related to whom who did they talk to are they influenced by did they collaborate with that person and we're surfacing that information and that's of extreme value to anyone doing historical research and trying to piece together you know how something happened and who were the people involved and so on so the strategy with respect to this is that we continue to use algorithms those would be Kazuntai we can continue to use algorithms which is illustrated on the right side to sort of feed into this this is supposed to be a cylinder into which we just keep pumping more and more identities that shade gradient there is to say that at the bottom of that that's sort of the under identified you know who is that or is that even a person or not and at the top of it you have lots of evidence you have lots of certainty it's been human curated and verified and so what we want to do is to increase the volume of that cylinder but also increasingly over time having within it an expanding core of more and more verified and reliable identities I bring up the international standard name identifier dot org actually is there is I've been very much influenced by discussions I've had with people involved with is any dot org but this is a group of people who are really looking at identity reconciliation for public identities for people and not only within the cultural heritage sector well it still remains the culture heritage sector but not just the cultural heritage institutions but also rights holders publishers you know managers of people's rights and you know music industry etc so for artists authors etc is that they really need a way of being able to identify all of these people and keep track of it and so there's a collaboration going between the cultural heritage repository communities and these other entities that are in the sort of commercial private sector to collaborate on building reliable identities and the ability to reconcile against that there is some similarity in what we're proposing for the cooperative in the way is neat currently works which is to say a combination of small lock smart algorithms on one hand and smart people on the other there are differences though in this in that is me really is cross domain but a lot of the major stake holders are interested in in living people or at least those people with which there are associated money issues rights issues but it also it's not restricted historically because there are archives beginning to join is knee.org in lots of libraries as well and then I like to think of the the archival domain overlaps considerably with these other domains except for the fact that I've jokingly said that archives are interested in the long dead the recently dead in the nearly dead and and so a lot of what is knee deals with maybe people that will show up in archives but haven't shown up in there yet so there is a sense of a continuum. So I would like to think that that the cooperative could participate in is knee at some point is to quote from one of the an is knee presentation was to collaboratively consolidating identities at universal scale. So the next steps with respect to the cooperative is we've done a lot of planning. We have a proposal to the Mellon Foundation underway to launch a pilot phase of this and this is still out outstanding so we'll find about out about funding in June. But in this proposal the host will be the US National Archives and Records Administration. They've gone through the legal work of establishing a charter to make sure that their participation as host in this falls within their legislative mandate at NARA and they've assigned staff to work on this and they're they're pretty excited about all of this. The technology infrastructure at least temporarily for the pilot phase will be hosted by the University of Virginia and we hope to launch this pilot in July of this year. The inaugural members of the cooperative we kept this kind of small because we have a whole lot of issues to work through and getting really big too fast would make it really hard so we have a fairly small number of keenly and interested enthusiastic institutions involved and I think you know we don't have any state archives or historical societies which is sort of a shame but we've got two museum one natural history and the Getty Research Institute and the American Institute of Physics gives us a nice interesting group and then you can see there's a lot of August places or slightly maybe perhaps less August places but still important and we've got the three biggies on the national scene the Smithsonian the National Archives and the Library of Congress it may be unprecedented that they've cooperated in this way before so we're kind of excited about that. The inaugural director of this pilot phase will be Laura Campbell who's the retired associate librarian and chief information officer of the Library of Congress and she's in that role because she's done things like this at the Library of Congress and she knows how to do them within a federal context which has lots of challenges and the deputy director of the cooperative will be John Martinez at NARA. Jerry Simmons at NARA will be in charge of the governance side of it and organizing that Worthy Martin will be the technology lead and then there's a whole lot of other players involved and so quickly closing thoughts on this. I discovered that trying to turn something that's R&D into a program especially within the context of the federal government is exceptionally complex in multiple ways and I think the most complex part of this is the social part but I think that's probably too for all of us and most everything that we do that they're you know intellectually and the technology is quite frequently there for things that we might want to do but it's convincing other people to get on board and fund us and join in and agree to work with this is the most difficult challenge. And in the end what we hope that we'll do with this is contribute a significant component to the international humanities research infrastructure in building an international community of collaborating archives libraries in museums.