 My name is Michael Buckland and I'm at the University of California at Berkeley Mostly I worked in university libraries and then I taught and then I retired and now I just get involved when fun projects My colleague is Ryan Shaw who was a software developer who came to my school in Berkeley and became interested in what on earth is a historical event and what would you do with it in terms of software and He is the chief scientist on this project and I will provide just a brief introduction to the project and then the meat of the presentation will be By Ryan and then after the after that We'll talk briefly about what we plan to do next and I hope there'll be some time for question and answer We should acknowledge that the work is funded by the Mellon Foundation Primarily with some help from the Coleman Fung Foundation now scholarly annotated editions of Historically important texts are a major resource in the humanities and needed for kinds of purposes They're very hard to prepare It takes a lot of highly specialized Expertise over a long period of time. They're expensive to produce. They're difficult to fund and The work practices are still deeply constrained By the mentality of the print-on-paper codex That's the context we became Involved with the Emma Goldman papers project Emma Goldman was a notorious anarchist a hundred years ago Barnstorming across the United States giving inflammatory speeches and charismatic figure and The more we learned about the circumstances under which the editors prepared their material The more appalled we were The way it works is like this The editor finds an interesting letter addressed to Emma Goldman and it's signed perhaps by Fred Schmidt the first question is who was Fred Schmidt and Probably that was not his real name because the anarchists that she corresponded with didn't want to get deported So they used pseudonyms. So who was Fred Schmidt really and what was interesting about him and what was his connection to Emma Goldman and In the second paragraph it refers to the recent happy event Was that somebody's baby or was that the blowing up of the LA Times building? These are the questions that the edit the editors have to figure out and They prepare lots of notes Ultimately, they prepare a detailed explanatory footnote that makes the letter meaningful for the reader Couple of years later, they get out a wheelbarrow They take the manuscript of the next volume down to the University Press the University Press that's wonderful But you can't have that many pages. You just take footnotes out and we'll publish without them It's known as return on investment Meanwhile in New York City The editors of the Margaret Sanger papers are doing their work now Margaret Sanger the Advocate for women's rights and birth control moved in a similar social circle to Emma Goldman They knew each other and it's quite likely that they have a letter from Fred Schmidt So their question is well, who's Fred Schmidt? They don't know the people at Berkeley already been working on this and Who was he really and what's his connection to Margaret Sanger and why is he interesting in this sort of thing? Two years later they get out a wheelbarrow They take the manuscript for the next volume to the University Press and the press says wonderful too many pages take the footnotes out You get the picture Multiple duplicative research Wasted so we we came up with Really really basic technological fix So when their age of desktops had warmed up and their obsolete version of word perfect had loaded and they wrote their notes Save as HTML Then carry on as normal Hang it on a website. It's immediately available in full for free for anybody. That was the basic idea Now a lot more has been done than simply that but that is the fulcrum upon which everything else Follows as it happened. None of them had a suitable website for putting the stuff on and we built a website for them that they could share and then last August Without any announcement. We simply removed the passport password control So that anybody could see it Not everybody could change the stuff or add to it, but anybody could see it and within a month People from all around the world were using it It had been indexed by Google and Yahoo and so on now I won't say it was getting very much use but it was really cool to see this actually happened It was what was meant to happen, but it actually did and so now I'll ask Ryan to Get into the substance of what happened. All right. Thanks so Michael gave you a little bit of the background on on what documentary editing is all just Illustrated a little bit of what of what he was saying And then I'll describe the current state of editors notes after the past two and a half years of us working on it Then the second part of the talk will be talking. I'll be talking about What we're looking to do going forward. So we just received new fundings. We'll be working on this for at least another two years one of the things I'll be focusing on is our plans to exploit links data so consuming links data Rather than just producing it And also some of our ideas about what we call hybrid any scholarship that Michael explain a little bit at the end. So the first part of the talk is kind of the what we've done so far and the second part is Things that we plan to do So Michael mentioned the the context in which we're working which is documentary editing Editors prepare these collections of documents like letters and articles and diaries And they produce these printed volumes in which they try to provide context for these historical documents The typical workflow in a project like this is that they gather documents from various sources various archives the particular Scholars that we work with and working in this area of radical history where many many times these documents are not actually in official institutional archives that have actually been collected and maintained by activists and people sort of on the fringes They select particular items that they try to want to try to contextualize They publish the final product in the term in the shape of these documentary editions And then they repeat as funding allows and many of these projects have been going on for decades So just as one example the Emma Goldman papers This is their final product is a page from from one of the edited volumes. We see a letter here from Emma Goldman and Footnotes the bottom explaining the references and names and so on This is the output of that process the input is Boxes and boxes of documents file folders and cabinets also a lot of digital notes including stacks and stacks of old Floppy disks with Fox Pro databases and whatnot on it even Will tape back up of some of the early early days of the project and these days also the the shared File directories and so on that they that they contribute to and take their notes and a lot of the work is a Lot of the communication happens Like this, so this is a an image of a note written from One of the editors at the Emma Golden papers to the student research assistant who is working for him. This is Patrick Lenin at any of his family members besides his brother been imprisoned What was the book he had written on political economy that was used in Russian universities and then a reference to He thinks it was the New York Union post editorial that Presumably Lenin had written in 1918 on the IWW Verdicts hard to tell if that was actually related to the other things are or not, but this is the note That Patrick got Patrick then took this Went into the archives went to the library trying to answer these questions and typed up his notes Including the sources he had This may have been in the Word documents it could be in a yellow notebook it could be in an email that The research assistant sent back on this case you reached a negative conclusion to the question that the editor asked But nobody's ever going to know unless they happen to be going you know break into the Emma Goldman office And happen to be going through that box and find this document Because in this case because it was a negative answer to that question It's not the kind of thing that's going to go into a footnote which generally contains some sort of positive information about Something that was referenced in a in a document So when we first started this project we were interested in addressing a number of problems That the the published volumes and the work that goes into producing them is very expensive and and funding these projects is is difficult and there was a lot of Great research that was being done that was not ending up in those volumes So could there be better return on investment by making more of this The product of these projects more widely available There's a lack of space for all the footnotes that they would potentially liked all the contextualization They would like to put into these volumes That it was made available online that wouldn't be an issue Much the research done is lost over or not included at all Whether that's fact-checking especially this kind of falsification or dead ends It's often useful information to know it was not Found despite looking in a number of different archives and yet that's not the kind of thing that would get included in an edited volume Tangential biographical details that may have the interest may be of interest to someone But are not the core focus of the the editing projects, but are something that they turned up while researching something else and then the preservation and legacy of these projects, so if you have Very skilled scholars working for 30 years. It's kind of a shame that if the edited volumes are all that's that's left of All that work So enter the editor's notes project So you can take a look at editor's notes if you actually want to look at the tool itself. It's an editor's notes.org There's a description of the project this e-card at work Address including some of the publications we've written about it And our aim was to provide a safe place for this kind of debris of the research that that Is conducted in the course of doing a documentary editing project Improving this return on investment like I mentioned and We're gonna really to focus on So one of the we could have approached this by let's say let's go in and try to digitize All the stuff that they've already done all the notes. They've already created and We didn't really want to do that. We wanted to focus on What would we need to do to actually change the way the editors and the researchers work together? So rather than just Saying okay, we're just gonna come in and digitize everything We wanted to intervene in what they were currently researching And see if we could develop tools that they would find usable That would enable some of this stuff to be captured as it was going on rather than just digitizing it afterwards So we identified a number of early on we identified a number of principles that we wanted to follow The researchers are already overworked and underfunded. We didn't really want to slow them down Anymore than we had to so a minimal amount of friction to their research project We were working with a few different projects that had different work habits We didn't want to force them all into the same exact workflow But we did want a consistency in the kind of data models that we use so that they could exchange information since that was one of the big motivators for this was To allow one project to use the research than other projects that they developed We wanted to build on existing technology wherever possible not try to build everything from scratch And we wanted to adhere to two web standards. It really felt that this of the web So The underlying data model basically looks like this There's the notes that the scholars and researchers produce Notes can be divided up into individual sections each section may cite a particular Document that was consulted We have metadata the leader of the metadata about the document there may be scans of those documents Or and or transcripts of the documents the transcripts can be annotated As well, so we can do some of the the footnoting like that would end may end up in an actual published volume Right both notes and documents are indexed by topics Which may be people places organizations broad themes these can be whatever the editors think is appropriate and then topics can be associated topics may have a summary Sort of a summary article. This is something that they Emma Goldman paper does a lot that they they keep up to date Summary of what they know about a particular topic So that's kind of like a special note And then I'll talk a little bit in a moment about these these factoids So here's an example The interface for adding a document As an example of building upon existing infrastructure we interoperate with the tarot for the document metadata So even though you can enter it directly into editors notes If you have a Zotero database or Zotero account, you can simply sync those documents And I'll have to re-enter everything here. So we didn't want to we didn't want to reinvent the wheel for bibliographic metadata We use a service from Microsoft called zoom it which allows you to Well, we keep the actual Scans, but then Microsoft provides a an interface that allows it to be easily zoomed zoomed in on and Then the transcripts of the documents are stored in HTML and we have this interface to annotate passages of text The topics here's an example of entering a new topic. These are the primary primary method of indexing items They're classified by type like I said, we can have persons or places or organizations. It's just a sort of another layer of organization for the topics and because topics may be created by Different people over time. They may not realize that a similar topic has already been created and also because we're interested in Having a consistent taxonomy of topics across the different projects That that want to share information We also have an interface for clustering and merging topics that uses Google Refine So people that users don't actually have to interact with Google Refine, but on our server It talks to Google Refine and looks uses the tools that refine has available for clustering similar strings and things with similar properties and Alerts people says you have two very similar topics here. Do you want to merge those or do you not want to merge those? So it's not an automatic process because there are sometimes Say people say a junior and senior Who have very similar names? We don't want to automatically merge those We just want to alert people to when it happens and then there's notes the notes are the most difficult Part of the project. This is the part that we spent the most time iterating on trying to come up with a good way of modeling what notes are and reflect editors thinking about it So notes are messy and purposefully so One of the issues we had early on is that people felt that putting their notes into a system like this made them look too polished And they didn't like that that they they wanted it to look The interface to reflect the messiness of their thinking at the time So how do we model something that's that's so chaotic and idiosyncratic? How do we create something that's easy to use and flexible but has enough consistency that we can share information across projects? So after Lots of iterations the we settle we finally settled on something that seemed to meet the needs of the Projects that we were working with here's the interface for adding a note. That's what I mentioned earlier is that Notes can have these different pieces so an individual note may be covering kind of a broad theme or a broad question like Margaret Sanger and the third ICCP conference and But then there's the individual documents that were consulted and notes specific to those documents that make up the body of this note and that modeling things in that way Enabled us to both meet the needs of the select groups at the Sanger papers where it was more their notes were more around particular research Questions and then all the places they had looked to try to investigate that question and what they had found Versus the styles of mother projects where it was more centered around the documents that they looked at so where the title of their document would be more like Margaret Sanger's letter to parole S. Buck and they basically just be taking notes on that and then maybe using some keywords to group those together So those notes have a description They have some status which indicates whether or not they're being actively researched or not So open means there's a basically a question that's been posed But and they're still actively looking for information on it Closed means they basically feel like they've gathered all the information they need about that topic hibernating means They haven't actually gathered all the information that they need But they either decided to work on other things for a while or they've reached the dead ends and So they're not willing to close it, but they just need to sort of put it off to the side and that's the hibernating notes They can assign users to a particular note So we go back to the example I was showing earlier The editor could create the note about Lenin's brother or whether other people in his family had been jailed And assign that's Patrick With his initial questions there and then Patrick could take that over see it have been assigned to him and add his research to it I already went over the sections And also these notes are all Are all stored as HTML and every revision to them is stored as well So there's full revision history for For all the information that goes into editors notes. So just to show you some examples here Here's one from Margaret Sanger papers we see this one is Still open. Here's the initial question. What changed as a result of Sanger's trip to India Lots of the books written later seem to describe the same situation as before Sanger went there while others say that she was influential Here's the actual body of the note itself Where we have each document that was That was consulted and then the notes that were taken on that particular document The the notes are at a document level And that's a good question though because there is this issue about the The granularity of the bibliographic metadata and that's something that came up with our integration with the tarot is that They sometimes wanted to model a Document that they had consulted was actually a page in a scrapbook Let's say and it's and the tarot doesn't necessarily provide the tools to model things that in that fine-grained of a way And so we actually have some ability to extend beyond the Zotero data model when it's when it's necessary to add some additional Properties and things like that, but we can go here Click on that and get This one actually didn't come from Zotero, so it has no connected Zotero data And Sure, why doesn't how many the note? This note should be showing up there, but okay And then here's the topics that are were used to To index this note So basically mostly the people that are mentioned here But then also broad themes like women in India birth control movements in in India and so on And here's the history of the note. We can see that the revisions that that have been made Slightly different style of Using editors notes this is from the St. Anthony papers There were a little bit different case as Mike will explain a bit more in a moment They are a project that was in the process of ending So they didn't have as many active research topics. They were using editors notes more to To publicize particular documents that they had collected so they've taken all these Test cases during reconstruction And scanned all the documents and made them available and then the notes are more kind of intended for public consumption rather than their Their internal research but we can look at Filter by particular publications Here we can see an example of Something that we have scans for there's not much to this scan, but this is the Zooming interface and so on and here's an example of a document that for which a transcript has been provided and There's footnotes on individual Sentences from the transcript with some explanation. These are also modeled as notes In the underlying database but presented in a different way here Okay, so what change is the basis of? Using editors notes rather than doing things the way that These projects had been doing it The notes themselves have changed from these totally free text documents to these semi-structured blocks that can be rearranged in different ways And this is one of the site is one of the best benefits of using editors notes from the projects is that Whereas before they had even one even projects that were that mainly works digitally We basically have a single rtf file or a single word file and that was in one one folder in there Their shared server and here they could have a Some notes about a particular document that was actually relevant to a number of different research questions and they had the flexibility of Including that in different places because of the way that our database was structured that a lot of the information about people in places and events to which some notes were relevant was Implicit in their previous organization systems and here we made those explicit linkable entities that they could point to And Most importantly from our point of view is a lot of this stuff that was stuck in these bonding cabinets Was now openly available on the web something they could they could cite Early on in this project and actually one of the things that motivated what it was one of the editors from the Emma Goldman Papers Had some of his edits to the Emma Goldman page on Wikipedia reverted because he couldn't point to any sources that He was citing more research that they had done for the current volume Which hadn't been published yet, so he couldn't point to it inside it Now he can point to the page on editors notes and because it's external on on the web That's enough to make the Wikipedia editors happy So the benefits these connections linking topics are freed From the minds of the editors and the researchers and our index made available for anyone to see These standardized records of the work can easily be Revisited from out from within a project or from outside So a lot of the usefulness was just there's a high turnover in these research assistants And they could really quickly come up to speed rather than just having to wait until somebody told them about oh It's in that box over there It's a new way of seeing some of the the kind of Outer edges of humanities research So what the work that actually goes into producing these these published volumes? and It's also evidence of the work that's actually going on in these projects also important from a kind of PR or funding point of view for a lot of these For a project. It's like well your last volume came out Eight years ago. What have you guys been doing and you can actually say it point to while there's we've been doing a lot And then the system itself The source code is all available on github It's built all using open-source technologies Django web framework Data is all stored in the Postgres database using its ability to store XML we actually store XHTML for all the documents We use Haystack for our Full-text searching I mentioned this is tarot and Google were fine And when we opened up to the public and allowing other people to create accounts we started using Mozilla persona for ID management Which I highly recommend it's a very nice way to avoid having to Keep passwords and and personal information about users But not require them to use a Facebook or Google or other privacy invading method of logging in to your tools Their personal information is actually stored in their browser And yet they can still get the kind of one password Logins to sites I'm not the person who actually implemented this so I'm not sure if I can actually answer that question but So Patrick Golden who's our lead developer right now Could answer that question, but from the point of view of the end user They click on a button on our site and it pops up a window which has the Which is from Mozilla, but then it is storing information in your browser So like when I assign on I get it stores it in the the local database and So I'm not sure about all the different models that and how exactly that That works, but it's worked pretty nicely for us Okay, so looking forward to what we're planning to do for at least the the next two years So there's a lot of desired enhancements that came up in the process of getting to where we are now one of the things was There's a desire for better sorting and filtering and aggregating of notes Beyond what we can currently Allow There was Desire for we have I mentioned a little bit of the the naming control in terms of the the clustering and merging of topics But also making things discoverable in other kinds of contacts, so somebody has An archival finding a that's mentioning some of these people making it easier to integrate maybe some of the interesting related Notes that have been taken about the people mentioned in that finding it and then also wanting to have These cool temporal and geospatial and relational visualizations of the the research data that was being collected And all these things pointed to a need for more structured data, so a lot of the reason we couldn't provide the Enhanced sorting and sorting and filtering of things was that the editors wanted to enter everything as free text They didn't want to fill out forms and yet if we didn't have that structure It made it a lot harder to do those sorts of things So it's kind of three basic approaches For adding a structured data to these semi structured documents one which was a non-starter for us was We're seeing kind of more schema-based authoring basically filling out more form-like interfaces and and for these kinds of messy notes the The editors don't really want to do that Another possibility is that we do some kind of trying to automatically enhance These textual documents with with metadata so using some some information extraction Software to try to identify named entities and reconcile them Automatically and that can work pretty well, but not when you have people who are really picky about things being right So if you're willing to accept Mistakes and and bad data Then that can that can work. Okay, but again that we found that this was not Something that our users can be happy with But something we did One experiment with was human in the loop reconciliation of Topics that were being researched to external data sets where we could import Structured data that other people had been producing particularly from the library community Also sources like Wikipedia And we're hoping going forward also from more and more from archives so We started experimenting with linking our topics to external things like the virtual international authority file dvpedia the Deutsche National Bibliotech Information about people and things they've done with that British national biographies all these these emerging sources of of linked data So an early experiment in this vein we developed This linked data harvester that where we basically as people created topics. We send their topic names to Service called same as which would then return Canada URIs for each for each of our topic names and then we would go and Get representations of those to Canada URIs from these various sources and Basically harvests a bunch of assertions that had been made about say and the Goldman in these various sources So we get back all these these statements and We would actually separately store in separate graphs the information we got from Different sources and then we provided an interface to the editors where they could see the Assertions that had been harvested and basically accept them or reject them And they could do that either to an individual assertion level They could do it at property levels and this is a property that I don't really care about So don't show me this anymore or at a whole source level like You know free base That's terrible information. Don't show me free base anymore. Don't show me free base anymore as it relates to to Emma Goldman So that was a pilot project that we did kind of in in trying to actually be a consumer of link data And we learned that it was possible to build a system to automatically harvest these this relevant link data and then we could Provide this interface for interacting with it But we also found that this editorial control Needed to be better integrated into the note-taking process so that people it would violated Our principle of not adding friction by having this extra thing to do where you go in and accept or reject statements We also didn't adequately demonstrate why you would want to do this We started out with collecting data rather than Immediately trying to to add some of the benefits that that I mentioned of having structured data So we decided that when to do this properly, we needed to not just aggregate and edit the link data But really try to usefully exploit it So in our our what we're currently working on Is trying to address some of those issues so having in-process reconciliation we're rather than separately creating topics and then later reconciling them in this separate batch process that Editors are creating and linking to you and reconciling topics as they Take notes. I'm sure we're trying to make interfaces that are making that easier to do And then most importantly motivating this structured data use Enabling them to Well, whereas before we were just enabling them to store and edit the data they're not really providing any sense for them to do so Now we're trying to Have a system in which they immediately get New abilities to sort and filter their data or to create simple visualizations by doing it So there's a media payoff for linking to structured data So we already where we do have structured data is the bibliographic metadata that we have and so we already Provide tools for sorting and filtering based on on that data And now what we plan to do is extend that so we can filter and sort notes not only using the dates the cited documents But information like the locations and birth and death dates of the people referenced in the notes Locations and dates of the existence of the organizations locations and dates in the events that are referenced In terms of visualization, so this is an example of sort of a side project that we did Where we created a visualization Based on notes from the Emma Goldman project where they maintain an internal chronology of where Emma Goldman was on each day And we would link to those two documents that they had scans of And then we gave you the ability to narrow down on a timeline and search on the map to see particular places that she had been and follow her travels and this entire documents is actually all self-contained graph of linked data It's all RDFA embedded in a single page there and then the JavaScript interface for for interacting with it So the editors really liked that they wanted to do more things like this, but it was a lot of work just creating it for this one chronological document So that's the the kind of thing that we want to make easily available but for any of the the note related data, so a Note on Rama Rao and the fourth international conference on Planned Parenthood is no longer We wanted to no longer just be available as a textual note But also visible as a map of the specific locations in Stockholm and Bombay that are mentioned in the note a timeline of dates associated with the conference Network of relationships among the people and organization So basically any topic which has geographic coordinates We can map any topic that has time points or ranges can be put on a timeline any Relationships among topics can be visualized as a network And we expect the benefit to this We hope that these the working notes will become more repurposable That they're not just things that we can read but we can actually feed them into these visualization tools They can become more discoverable because we've linked our topics to standard identifiers for things so it becomes a lot easier for somebody else to come along and query these repositories of notes for everything that's relevant to a particular topic and And The last thing is shifting the focus of these projects from the one-shot product the edited volume to this continuous data curation process so to really See the projects more as this continuous Production of and collecting data from archives and adding to it and enhancing it and then and Making it available again. I think we're almost on a time, but that's the the last Kind of piece of the puzzle here is What happens when these projects and and can we build systems that continue to be available for others to pick up? Even if once the the funding is no longer available, and that's the the hibernating scholarship available Idea that In addition to all that the editors found their editorial projects easier to manage with this because they could tell who had got Made how much progress with what? It's also the case that curators of special library collections Have a similar knowledge and could make similar notes about the documents in their collection and we experimented with that and We it's also true that archivists professional archivists know a lot about the documents under their care and they Can could do a similar sort of thing and we hope to explore that with the California State Archives in the second phase which started this week in addition to trying to explore what archivists could do and the What Ryan has just been describing Which I would describe in the following way if you go to a big digital humanities conference You get these presentations of really expert people who've spent hundreds of thousands of dollars Developing some really spiffy visualization How can you get some version of that functionality? Down to the level of these poor people who are working so hard and are not about to take the time to learn much That's driving What Ryan was talking about but the hibernating scholarship issue is this Nobody seems to know what happens to these editors notes when the projects end the granting agencies They give a grant to produce the manuscript for a publication the moment the manuscript for the last volume goes to the press the staff are laid off period seriously, this is a soft money business and If it's a founding father or a president then probably the project will be sustained and the records will be sustained and Most editorial products are not that Right now well last September the joint Elizabeth Katie Stanton Susan B. Anthony project that Rutgers ended after 25 years of a large well-managed project Okay manuscript went to the press in last September and That was what the grant the final grant was for the staff were laid off There was no discussion as to what to do with the large room full of all the kinds of material that were in the image Not even any discussion. The only thing that was certain is that the dean who owned the space was going to want it back What's wrong with this picture their notes included Extremely detailed step-by-step notes on Everything that had to do with every attempt to get votes for women in each state Those records are not in the published documents, and they don't exist any other place. It's a terrible waste so a major part of the second phase is In the first year To work with Rutgers with that projects down and put under the School of Communication and Information fortunately To do archival processing so it's a ready-to-shelf archive and the lessons learned in doing that will then be Put into the work practices at Elmer Goldman and Margaret saying this project is a professional work practices Issue very much. That's why it has to be done slowly but It raises interesting questions about the relationship between the eventual published volumes and the research that's been going on In effect what has happened by savers html and putting on the website the boundary of what's published and what not published has been moved You see so there's expensive eventual printed editions and not the sum total of what's published and that it is notes I've published to including the problems. They can't solve. It's like the 19th century notes and queries genre reinvented and Projects end but scholarship goes on and Not only can they look forward to that Frustrated obsessive local historian in Possum Creek who has the membership list of the Emma Goldman admiration society In 1895 and doesn't know anybody interested enough to share it with But it invites a 180 degrees turnaround on the relationship between the published volumes and the working notes Because I think the implication is that instead of the working the eventual published editions being the sole and only product Which it is now instead they should be seen as intermittent desirable byproducts of an ongoing workshop defined as The working notes and the expertise It's what I like to call the sleeping beauty mode of archives That is to say if Prince charming comes along with another grant that work can come to life and be continued Now this isn't exactly conventional archiving which produces a static collection. It's somewhere between an archive a conventional archive a library special collection and I'm not quite sure what else and That's a really interesting part of what we're going to address Thank you very much. We're stopping you from your