 Hello, good afternoon. I think I'm going to begin because I don't want to be the last thing between you and your first drink of the evening Right, so we have until six o'clock Please stop me at any time that I'm speaking quickly because I tend to speak quite quickly So my name is Barbara Toronto. I'm the digital program director at the New York Public Library There are many people at the library who have the title their title includes digital When we have a chat during the break, I can tell you about all of them So This is a very famous picture from the library and I know mr. Walker is in the audience It's actually in above Aster Hall. It's a description It's actually a drawing of the way the stacks exist below the building at 42nd Street And you can see the eight stories of stacks that are actually holding up the Beaux-Arts building That is a hundred years old this year a very interesting Philosophical position to be in to have the books holding the building together so from physical to virtual by a 2.1 million Handcrafted item level metadata records And you will note on the bottom That the rate of creation accelerated for demonstration purposes. So in fact making metadata records Looks like it happens this quickly, but of course doesn't happen this quickly all these metadata records 2.1 million were done and In part to imagine several large-scale projects the first of this was the digital gallery, which is right here in front of you with 800,000 Images available to the public in motion the African-american migration experience the national digital newspaper program and The digital object preservation access repository All of these things together were Drivers for creating metadata records professionally During the six-year project that I mentioned earlier called a digital gallery It was in known as visual treasures the library employed 47 subject specialists to work on descriptive metadata and to the left I'm sure you can't read their names, but they're up on the website Each one of them a professional many of them with multiple degrees As you can imagine We call Their activity OCD and if anyone heard me speak about this DLF it stands for obsessive compulsive descriptors People who have never ever put enough subject tracings on anything I Think that possibly rare book catalogers sort of are the premier case for that one of the collections first collections described Was the miss Frankie butthole of American menus collection 1851 to 1939? And American here refers to the collector not to the menus because the menus are from Many places many of them American, but many of them German from ship lines coming from Europe And on the left you can see there's a little Collection up on the web and digital gallery about them This is a view of the collection the physical collection as it's housed in its new home the rare books division It's a many many many linear feet It used to be in the general research division and used to be available for people to come in and request and Actually sit down open the boxes and go through all the menus manually anyone who could make an appointment Once they were digitized the collection was removed from public service and rehoused and transferred To a monitored environment and now lives in Long Island City That's that's not degraded. That's that was that's going up The physical collection was used a lot there were a lot of any very interesting stories on the right You'll see catch the day see food trends one of the most interesting things out of the physical collection one story We heard was that folks a fish Population fellow, I guess an evolutionary biologist was interested in the numbers of dishes and references to fish in the menus at the turn Of the century so he could determine the fish populations around Long Island a very interesting application to look in menus So it's sort of no end of what you can imagine a researcher would be interested in The menu data began life as a list created by volunteers and the volunteers at the library can be any age But generally tend to be people who do not need to work for a living who sat at a desk in the reading room and Literally handcrafted typed at first Descriptions according to a protocol that was given to them by the librarians Eventually lists was migrated to a spreadsheet and then to Fox pro database and then to the file maker database And then to the act to then to an access database and finally to the Hades, which was our then metadata authoring tool The data was munged and concatenated and scrubbed There was a lot of strange Features in the data Normalize extracted indexed and ended up in digital gallery looking like this and I don't know if you can see it But yeah, and on the bottom you can see there's actually a metadata record With some data in there a profession not professionally created professionally cleaned up Lovely to look at and talked full of human readable information Yet for all our best intentions in the thousands of hours of work of metadata creation Our users were frustrated and we were very frustrated The kinds of questions you get how come I can't search on the items in the menus How come the personalities and place names are not in the index? Where are these places? Where is that? Where's the Aster hotel? Why can't I do a deep dive into the data can have all the Roosevelt related items all things You can imagine someone be interested in but we had no way to give them because in fact what we had were images okay The need was apparent and the will was there, but the money was gone In 2007 faced with cutbacks and baseline figures Regarding the true cost of metadata creation about a thousand dollars a line we took a deep step backward and looked at the program and Changing landscape of social media and what you're looking at is a picture of my son a few years ago on Two telephones and two computers and watching TV and playing the game Among other things we had an issue of scale Not enough staff and possibly impossible to ever have enough staff to do the work we wanted to do so a Rather senior manager at the library says so why not just OCR it and If you look very clearly right in front of your face is pretty clear why you can't OCR it it's either handwritten or The fonts are so irregular as even when it's typed that no OCR engine was going to pick up anything greater than maybe the maximum Maximum accuracy rate we would get be 75 percent which is pretty high Unless you're a user and then 75 percent is a failure So we needed to deconstruct the digital workflow if we were going to manage the volume of new work that was required And we changed tactics Since we had passionate and engaged users We thought why not use them have a digital barn raising their canning club so our first pitiful engagement in what we might call crowd sourcing today was Barb puts $25 into the Amazon's mechanical Turk and offers people the opportunity to transcribe menus for a buck a hit What I didn't realize is that I'd have to keep fitting fitting the piggy bank not only feeding the piggy bank I was feeding with my own money and a Dollar a piece of work was a very high rate. I think for the mechanical Turk I think they paid 1 cent on 10th of a cent anyway that that was that was 2007 so that was the first attempt that at crowd sourcing data We were saved from our own folly By the library to sign to write two grants one to the National Endowment for Humanities and one to the Institute of Museum and Library services to create an application that would in fact capture the imagination of the of the Public and get them to participate in metadata creation The resulting application is called what's on the menu. It's at menus.nypl.org Very simple Excuse me URL you can see is a decent and interesting Interface the menus themselves sometimes are not exactly beautiful objects because there are multiples and duplicates Or what look like duplicates until you start to transcribe them and you realize that there are 365 menus from an ocean liner each morning the A new menu was written so they look identical, but only on examination. Can you see that they're actually different? different content So this is what the interface looks like We went through the process of creating jpeg 2000s for all the menu data because it was impossible to read the Text at the at the quality of derivative we were providing and You can see here. You can blow it up and see something quite closely. We created a structured Form so that we could well we thought we created a form so that data would go in in a structured manner Wasn't free text everything had to be parsed ahead of time We knew that we couldn't possibly just do a transcription because the transcription would be a string of text And we wouldn't ever be able to parse it back out So in fact a whole data structure was created for capturing this data Including I don't know if you can see the fields very clearly on that how much it costs, etc There is a Dashboard that allows you to see what's going on There's also something else very interesting We knew immediately that folks would be interested in exporting the data and being able to use it for their own purposes so we built a Service to allow people to download the data as it was created either to download data and add to it and upload it back up or To actually just download the data set and do work on it separately We also created heat maps so people could see exactly what had been transcribed So that you didn't have to come back and say well, I don't know who's done it or how many people have done it this has a I'm foreseen consequence and I'll talk about that in a little bit This lets you see Exactly how many things have been transcribed and how many things are on the list to be transcribed and Then there's a variety of Folks who are referencing it and it shows up quite frequently in Google images What's on the menu has some interesting other aspects it has been connected to things in book readers There's deep linking to the data on on the site Which is also presents a an issue as we go forward As you can see this the graphs of who's doing what and under what circumstances are making reference to things So the question is how do you engage the public in creating meta data? That's a very interesting question the top Frame on the on the very top you will see a blog Entry by a woman by the name of Rebecca Federman We were very lucky at the library to have a curator who is herself a Culinary star and because she was already a star and had a large following she was able to engage her Population in her community in the project once her community was involved in the project It took off like wildfire And you will see on the bottom that we did our own marketing by talking about our own projects in Blogs that are currently on the on the website So this social experiment because it really is a social experiment much more than is a metadata experiment Questions can you harness the energy of the crowd? Well, yes you can But how do you keep to engage? So one of the things that's built into the site are gaming techniques. There are small rewards for transcribing numbers of Menus you can keep track of your score. You can keep track of the number of things that you do you can have whole groups of people who? Can compete on the site to transcribe data? There has been an attempt to create flash events where people say let's go to a bar on Bleaker Street and sit down and spend an hour and see how many menus we can transcribe in an hour and Actually, believe it or not people show up And then there's there's compulsions there are people actually who are Addicted to doing this They actually really like it so the OCD that I mentioned earlier is not restricted to catalogers is clearly part of other people's mental makeup Has it been successful? Well, yes, it's been very successful as you can see as of Monday December 12th 2011 there have been 659,000 and 32 dishes transcribed from 11,000 1100 112 menus, which is kind of remarkable See it's only been up for less than a year On the right hand side you can see what amounts to a tag cloud Which to me is kind of useless in this. There's so many different dishes and They don't repeat that often except blue point oysters, which seems to be on the menu in every restaurant in 1901 I don't know how good of a navigation tool is but there's certainly a lot of dishes So it raises a lot of serious issues for libraries I mentioned earlier the heat maps What was meant as an indicator to users that in fact work had been done? Detracted or I don't know inhibited. I don't that's the right word didn't prevent but It encouraged people not to go back and transcribe a menu that already been transcribed And therefore we have single transcriptions for menus, which is a feat But is very unscientific and we have no way to prove that the transcription is at all accurate So the second thing that's very interesting here is what's the public's appetite for doing this and I guess the pun is intended It was very successful. It got a lot of press Question is can you maintain that kind of activity? Can everybody in this room decide to go out and do a crowdsourcing project at such a size and get that kind of response? And I don't have an answer for you. I'm not sure how long we can Entertain people long enough to get them to participate in these activities So I think another way of putting this is are these projects sustainable as a meta as a model for metadata creation? I think they may be sustainable as part of a solution as Something that you could do rarely. I don't see that it's possible to do this on a large scale all the time and replace Professional catalogers not only that these things would be undiscoverable without the professional catalogers in the first place The point I raised earlier is how rigorous is the data that we are collecting It seemed to me if we were going to do this again We would want to pull the community and create three transcriptions for every menu and then do some Computational work on which I'm polling and voting on which ones are accurate There's nothing currently that prevents you from putting anything in those fields Which is of course becomes another problem for a professional Metadata person to come back and clean it up and review it So did we just create ourselves a bigger problem? Question that arises is what measures are we going to use to ensure the quality as I just mentioned possibly triangulating and having three copies, but What really amounts to quality in this case And do we care about rigor and if we do what kind of rigor and if we don't care about rigor? Why not so there's a lot of discussion about? Collecting getting the public to work on creating metadata not a lot of discussion about what that means what we would get How we would control it? What kind of standards we'd apply to it if we did apply any standards and How would we enforce those standards? On the other hand if we raise the barrier too high, then we're not going to get any participation Some other issues have come up is how do these new activities fit into the life cycle of digital content at the library? The library for the longest time is talked about digital content as the reformatting either the reforming and formatting of his analog content and Keeping that as the permanent digital collection and or accepting born digital collections We're not sure where in fact things like transcription fit into the schema of preservation metadata or not preservation metadata Is it disposable? What's his disposition who owns it? Who's a curator of transcriptions of content? Is this part of the bibliographic record? Is it a mezzanine data store to facilitate discovery? I don't have any quite answers I'm just talking about the issues that we've faced So we've come to a joicy in moment When interests collide? But they collide a very special space so on one hand we have control vocabulary Standards authorized lists discoverability quality and management the other side we have tagging personal local meaningful aggregated textualized serendipitous and fun and I Don't I didn't accidentally pick the grim reaper And I don't know who the other guy is. I guess he's a ninja or something. I can't quite tell But I'm gonna come back to that So recently in Harper's there was a very interesting article by Andy Merrifield from an essay called Crowd politics and he says Nobody can know in advance when an epic historical Geographic performance will be enacted nor are there preconceived formulas for what makes a successful encounter What takes hold is what Joyce and Finnegan's wake term to Collider escape and what forms will the joy see and everybody begin to express itself as his chat as it is Challenges the crisis ridden and I put in librarian order Because I believe in fact we are challenging the librarian order with these activities So as much as I adore the right column I find myself inclined to the left column Because it is part of my job at the library To ensure the long-term preservation of our digital assets And if it's part of my job at the library, I don't know what to do with this stuff We face a huge preservation challenge Even if we make the decision to acquire and preserve the data set that has emerged from this initiative We're a long way from having established policies and best practices regarding such activities and or outcomes These are not new concerns, but these shoes are now bearing down on us a very practical and real way our collections development department has Talked on many occasions about the future when we will have knowledge products created from our collections What collections development hasn't even imagined that the future is now it is sitting on my doorstep With a request to move it into the repository So I just want to end with the true life story of waking up on a Sunday night and waking up on Sunday I must have been sleeping all afternoon. I'm getting email on a Sunday night. I guess it wake up to my email on Sunday night And having requests for the ingestion of no less than five digital humanities projects the data outcome from five digital humanities projects with very little understanding or even Discussion at the library on the library side about what that would mean in terms of digital preservation what it would mean in terms of the record the historical record and the social record and whether in fact this becomes part of our collection these are all policies that haven't been Investigated and haven't been explored and our ideas that really need to be fleshed out. So There is Data out there and there is more and more data out there But the question is what part of this data belongs in the library? What part of it belongs in the digital preservation? repository what part of it is disposable and fundamentally who makes those decisions And how do we implement them? Thank you. I moved it any questions Yes Yeah It's sort of similar except the New York Times will come to the New York Public Library and give us their archive and say keep it So it's not it's not a creation of the data. It's what do you do with it? And where does it go? So yes, lots of people are creating data The Problem for the library isn't that we can't get people to create data and as wonderful as it is and as wonderful as a discovery tool As it is the question is where does this fit in our preservation strategy? What happens when the New York Times does come and dumps on us or leaves for to the library? In fact this project In that case it might even be clearer because it's their collection But in the case of ours is this our collection. This is an issue that needs to be addressed Yep Higher for zero. Yeah Yes, there's a disclaimer on the front of it saying that you agree to do this and we can use it That's That that's that legal description describes or Covers the activity of typing in the transcription At the library Well, I'm sure I done certainly in New York Public. We're very strict and very clear Acquisition and deaccessioning policy So once you take something into the library and claim it's part of the collections It isn't so easy to get out again And then there becomes an obligation to do something about it and to preserve it So my issues are really around that whole problem set when does it become part of our collection? What part of our collection? How is it related to our collection and does it how do we even begin to plan or make policies about what we do? for with it in the future Right now it's a fabulously fun project but Digital humanities presents to us a very Different set of issues. I mean they might be the same issues But there seems to be some understanding that the outcome of scientific work will prove Produce data sets that are part and parcel of the work I don't know that that is actually we've come to that point when they did in the humanities yet And then we've not done the planning for it. Yes Well, that's another set of data. We don't have Clear policy on yes, absolutely we could We could accept that there is a drive by part of certain Communities in the library That actually wanted to treat it not like that but to treat it like part of the collection So the issue isn't really about what I would prefer to do or what I could do with it It's a question about what the library or collections development wants to do with it Do they consider it part of their collection? Will they take it in as their collection? What does that mean? Not not just as a navigation tool. Yes as a navigation tool we can throw it away, right? Clearly be part of the collection Right It feels like that to me as well But the issues have come been risen that it is in fact more than that it is in fact a data set in its own right and So and we should be acquired into the repository. So yes very very I Have all sorts of solutions. I have all sorts of technical solutions. I don't have many policy solutions You know It's not archiving websites we do not archive websites No No, I'd has something to do with It has a lot to do with the concept of the conceiving of what the digital Preservation repository is for and what the permanent collection is. Yeah Yes Yes People Yeah, we've got compulsive activity we have folks we have a small group of very Dedicated very interesting and then we've got people who've done it once and go away never come back again. Yes, absolutely So I would imagine it's the same with Wikipedia people who are monitoring on articles all the time Constantly updating them. That's their job, right in this case We have collided with You know the world of foodies which of course has its own Following and that's very advantageous for this kind of a site. I don't know if we were transcribing Well, currently we're we're digitizing the Tilden papers and I don't think that we'll probably get the rise out of the community to transcribe the Tilden papers You know miles and miles and miles of correspondence But you're right absolutely we have a small core who are clearly Dedicated But they might go on to the next thing right wait till they hit the next thing And I think that's part of my my concern about how sustainable these crowdsourcing projects are Yes Multiple reviews Yeah, I agree I think that would be a very good idea I think in the same way that Wikipedia does it where there are experts on the biochemistry Abstracts whatever someone actually reviews it to make sure that it's Reasonably accurate or and it doesn't have anything that's egregious We can certainly computationally find out find the bad words or the stuff Pete the kids have put in or somebody's put in But I think that's a very good idea. There is a They are doing version to they're trying to put in more gaming technology to continue engagement but also to I think they realized that when they started looking at the data as it came in that there were Tremendous issues with how they were ever going to monitor it and how they were ever going to clean it up. Yep, I'm sorry I had been submitted to the library, so in one case you had volume, and whereas in the other case you had high quality, and I think maybe that's something to be considered in a case like this is perhaps if people lived in some venues where a higher quality or a higher value statistically than others, you could maybe push to two groups and then again competition had improved. Right, I think you raised some really interesting points that you could go, there are many steps forward from this in order to improve the transcription, to improve the activity and the quality. I'll pass that on. Thank you. Do you have any evidence that doing this has increased or changed the use of it? No. I have read many articles about it. I've read many reviews about it, but I have no evidence that actually has changed the use of the collection. I do know that when they were first published in the digital gallery as images, they're the only, there was a large surge of people actually recreating the menus, actually cooking and serving the menus. Possibly that's happening again, but I have no evidence of that's the case. No, I don't. That would be interesting. I was just curious of what sort of trends and traffic in terms of people looking at the menu. The other way that you could think about this is the public engagement thing, like whether or not they're... The numbers of people viewing the objects is low, but the time spent is high because you have to transcribe it. But I think maybe one of the things the library should do is publish statistics on it. To know how many people... Yes, the statistics that there's so many dishes were created is interesting, but it's interesting more as a social experiment to find out how many people were actually engaged, how many people spent time, what is... All the things that you're asking about. And I'm sure that we could publish that. And that would be much more... What are the trends here? And how do we work with those trends to make it go forward? Okay. Thank you.