 Welcome to the session called archiving large swaths. Swaths, is that a, that shouldn't be an title. Of user contributed digital content lessons from archiving the Occupy movement. There's four talks, three of us here giving four talks. I'll do a brief intro and then we'll go into each of the talks. First of all, what's interesting or important about the Occupy movement. Well, here's just some statistics from Flickr. In less than six months, there were over half a million works put up on Flickr with a hash Occupy term on them. And then here's just some other statistics. So clearly there's a lot of media up online related to Occupy. Certain parts of the Occupy movement, certain parts of the discourse has entered public discourse. This happens to be an advertisement on a BART train in San Francisco. I'm still the 99%, people are talking about the 99%. But more importantly, I think for us at CNI, it's that the Occupy material resembles what we'll be facing in the future. First of all, it's a vast quantity of user contributed material. There's no easy way for us to control for quality file format or metadata. If we're taking in, say, organizational records, that organization can at least try to enforce guidelines on the material that later gets handed over to the library or archive. Even when we get the works of a single individual, there's at least some degree of consistency on that individual's hard disk, whether it's file naming conventions or file formats. There's some degree of consistency, but there's no way we can get this with this kind of vast quantity of user contributed material. Much of this material can most easily be found on social networks, which adds issues of do we have the right to do that? Is that a violation of terms of use? These are not open systems, so it's harder to harvest things from them, things like that. So all in all, we need to find smart ways to harvest metadata and analyze files as well as to influence the behavior of potential contributors. And those are the kinds of things that we're gonna be talking about today. Now, in other ways, this material may or may not be what we face in the future. Certainly within the Occupy movement, there's a huge suspicion of conventional organizations, including libraries and universities. There's a serious do it yourself mentality. They wanna control their own story. There's a reluctance to sanction turning over material to an institution. So we've had to finesse things like this where they just say that it's okay for anyone to use it. They don't say it's okay for this university to use it. And the consensus process becomes more difficult when each meeting attracts a different set of people. So we've been to a number of meetings where each meeting we have to start from zero explaining everything because there's 20 people who were not at the last meeting. So that gets a little bit difficult. Okay, so Occupy Wall Street in New York has its own archives working group. And in their mandate, this is from their website, in their mandate it's in the very second paragraph, it was created to ensure that Occupy Wall Street movement will own its past. So they're very concerned about owning their own past and not having others kind of recast the history of the movement. Okay, so the last introductory point is just the kinds of things that I know from my previous work, most of which I think I've presented at previous CNIs. I've just been coming, I've missed like two or three CNIs since 1989, given a lot of talks though not in the last few years. So a project from the archival world, the Interparas Project, one of the main discoveries, this was about 10 years ago, was that if we hope to preserve electronic records or other born digital material, the archivists need to be involved early in the life cycle, long before it enters the archive. So this is basically kind of turning on its head what traditional libraries and archives have done, which is when the material gets in the door, that's when you start thinking about cataloging and applying metadata. That doesn't work when you've got an avalanche or tsunami of material coming in when that material was born digital 10, 20 years earlier in formats that you can't read anymore. The other project that I was involved in five or six years ago, part of the NDIP project, preserving digital public television, we worked a lot on what we called pushing metadata gathering upstream into the production cycle, where we tried to re-architect some of the television news production workflow, so that we could grab what we needed for preservation and access early in the life cycle when it was known and in general never made it into the archive, but people knew this information early on. So with that project, we married a dam system, a kind of tracking system to a preservation gathering system so that we could save that material. Okay, so that's the introduction. Now we're gonna have four talks. I'm gonna start out talking about a group called Activist Archivists. Then David Millman from NYU Libraries will be talking about a number of activities there. Sharon Leone from George Mason University Center for History and New Media will be talking and then David will be giving a talk for Christine Hanna from the Internet Archive Archive Project. So to start with, Activist Archivist is an organization composed mainly, well almost exclusively of my former students and current students, not just my students at NYU, but one student from UCLA who basically started gathering together about the third week of the occupation of Zuccotti Park to try to think of what they can do to try to help archive the media related to the Occupy Movement. And there's about a dozen people who come to regular meetings an hour and a half a week and there's about a second dozen who just come not regularly, but irregularly. This is a sample of our meetings. I'm never in town so I'm always on Skype, but they put me in the meeting in various different ways and I have a seat at the table there from Brazil or England or various places I've attended the meetings. We have about a dozen projects that the Activist Archivists have done. These range from projects trying to convince occupiers why it's important to archive, to helping them understand what our best practice is, to studying metadata loss through uploading of services, to coordinating discussion among different groups involved in archiving Occupy Movement. So here's an example of what we will be distributing very shortly on a postcard. The idea was to make it very simple. One of the things that we've run into a lot in our kind of ethnographic study of Occupy is that lots of people say, why would you save that? That's just what I do. One of the characteristics of the Occupy Movement has been the interesting creativity of the signs when the archiving working group of Occupy Wall Street started saving the signs and then started asking for some money from the organizing committee in order to store those signs. There was a lot of opposition. People said, oh, would sarcastically say, oh, you wanna store a sign that 10 minutes before I scribbled on a piece of paper or some witty remark and you wanna save that. So there was an awful lot of pushback. So we decided to really, it really was necessary to express in a concrete way and succinct way why one should archive at all. And we spent a lot of time crafting this so it'll fit on a postcard. The idea of it's for accountability. It's part of self-determination for the Occupy Movement. It's archives involve sharing, educating and there is a continuity between past movements and present movements and future movements. And as part of that, we were also encouraging people to record and collect and preserve the material. We also came up with a very short one page document that is featured prominently on our website called Seven Tips to Ensure Your Video is Usable in the Long Term. It included things like making sure you collect details, metadata, keeping raw original footage unaltered, uncompressed, making your video discoverable, contextualizing it, making it verifiable, allowing others to collect and archive it if you put it in a public space such as YouTube or Internet Archive or if you don't do that, make sure that you have good practices for archiving it yourself. Another project that we did was an empirical study of metadata loss when videos are contributed to various services. So for this study, we recorded about 30 seconds of video on an iPhone, on an Android and on a Canon phone. We uploaded those 30 seconds of video to three different services, to YouTube, to Vimeo, actually two versions of Vimeo, the Vimeo Pro version that requires a subscription and then the consumer version of Vimeo and to Internet Archive. We then looked at hundreds of points of metadata and we found that in general, all the embedded metadata stayed with the file only at the Internet Archive, that both Vimeo and YouTube essentially either corrupted throughout or changed the embedded metadata. So it was kind of mind boggling and on our website, you can look at the spreadsheet with all the data and we have about an eight page report on it that is not up yet. We're just doing a little tweaking and that'll be up quite soon. But just analyzing what the problems are. But for example, YouTube does things like replaces the date shot with the date uploaded, things like that. So there's a huge corruption of metadata and that really bodes poorly for any collection that's trying to grab things off of YouTube because the metadata is just all corrupt. We've also created the best practices for content collectors with issues. For each of these issues, we describe what the concerns are, what the pluses and minuses of doing different things. So we look at security, both in terms of sensitive material and in terms of scraping for content. We look at and analyze issues of content search and what say the problems that we discovered with YouTube are and there's also similar problems with Vimeo. We discuss receiving the content and how to get it, how to extract metadata and copyright issues. For the content creators, we have the best practice that looks at security, including laws against having hidden cameras in public spaces that some states have, other states have consent laws where any party that is videotaped needs to give their consent. We discuss capturing content, offloading, uploading, depositing with an archive and copyright. We've also, this next area we haven't finished yet, we're in the middle of this, we're creating an Occupy archiving kit that kind of draws from the other things that we've done but it's really oriented towards a local Occupy group doing their own archiving. So we have sections describing all the things that they have to think about and of course one of those is partnering with an institution with some longevity and ability to keep the material over time. Two days ago, we participated, members of our group helped organize and participate in the Occupy Wall Street Archive Share Day at a local arts organization, Ibeam in New York. The idea here was again kind of marshaling the wisdom of the crowd to try to figure out new things that we could do for trying to preserve this material. So a variety of tools were used that day. One of our activist archivists, one of my current students used the tool Bulkr, which allowed him to go through flickering and do mass downloads of flicker based on the accommodation of the occurrence of a tag for Occupy and the occurrence of a particular level of Creative Commons license. So setting the filters and it also grabbed for all of those he grabbed the actual digital photograph, the exif metadata and scraped the tag text metadata off the HTML page that it was mounted on. So here's just an example of using that tool. This was one of the images and this is what the downloaded exif metadata is. I don't know if you can see that very well but it has basic info, the type of license, the date taken, the upload date, the geotag information, how many views, the title, a description, URLs and the various tags that are on the page accompanied the image. So the day two days ago was similar to other activities that Occupy in New York has done, Archive Share Day, Hackathon activities. They've included things like remixing of older footage, creating a visual timeline amongst the various still and moving images that have been taken over time and some things that we're not particularly happy with because of the methodology we think is pretty poor. They did a project on Saturday of mining material for data. So they were looking at co-locations of an officer's name with the word pepper spray to try to determine that ex-officer used pepper spray more than other officers. But I don't think their methodology was very good for that. We've been really pushing the Occupy record record supporting people to follow Creative Commons guidelines and to declare Creative Commons licenses to their material. This is one of the documents that we've created and distributed trying to explain what it means and explain that you don't give up your copyright, but also to explain that police can use this material and get people in trouble too. We also have another document about how do you actually mark the Creative Commons licenses on your material. Because of the concerns and again some pushback from some of the occupiers, we've been promoting a tool called ObscuraCam. It's called a visual privacy app that lets you block the identity of people. It puts a box over their eyes or over part of their face. One of our key partnerships is with the Human Rights Group witness. Graduate of my program is the Archivist for Witness. And they've been working with this Guardian project on tweaking ObscuraCam because they're using it for people in repressive countries who are trying to distribute videos of things like public whippings of women and things like that. So basically the ObscuraCam actually has a face sensor and as you move around it, the black box follows the face and blocks that out. Then the last set of things I'm gonna talk about that Activist Archivist's group is doing is our collaborations with both different Occupy Wall Street groups, helping them store and manage their media streams and with the Tamament Special Collection at NYU Library. With one of the things we noticed early on, a lot of the Occupy sites do live streaming. And we tried, we had some very good people trying to hack these live streams to turn them into downloads that we could grab. And over periods of months, we still couldn't hack into them. We've had discussions with people. I had a long discussion with one of Occupy Oakland's live streamers. He uses the streaming service as his file server. He purposely doesn't record onto any media in his camera so that he can, if the police arrest him, they will not try, in his mind, they will not try to confiscate his camera because there's no evidence in his camera. So he sends it directly over the internet to a live stream. And when he wants to edit, he downloads from the streaming service, edits, and then uploads it again. And he trusts the streaming service that it's gonna be there forever. I don't, but that's kind of one of the perspectives that we've been trying to deal with. So with the New York Group, their streaming service, they actually keep masters, a group called Global Revolution, which is related to the Occupy Media Working Group in New York. They keep their masters on many DVs, but they're having trouble managing the files. They want a digital asset management system. So one of the things we did was broker a deal to put copies of their content on a reliable archival service. And we helped them select open source tools to use for cataloging. And we actually are using Omega, which is one of Sharon's products. And one of our group's volunteers has been cataloging their mini DV collection of footage from all the general assemblies and spokes council meetings from the very beginning of Occupy Wall Street from September 17th. So here's an example of her cataloging workstation with the mini DVs. And here's the spreadsheet she's using before it goes into Omega. This is using a kind of standard that they developed called Occupy Wall Street Core. And here's their Omega. So we're in the latter stages of getting agreement for the Global Revolution streaming collection, this very large recording of all the official meetings of Occupy Wall Street for that to go to NYU's Tamament Collection. But they will not sign a donor agreement because they say that that's conferring exclusivity to a bureaucratic organization. So instead, we have email verifying this that they will execute a Creative Commons license letting anyone else use the material. And then some of our people will hand carry that material to Tamament, copies of the material to Tamament to make sure it gets transferred. But they will not do this officially because that's an exclusive agreement. Okay, and then the last sub-project I'm gonna talk about is YouTube. The Tamament Collection at NYU's been selectively browsing through YouTube Occupy videos trying to choose which ones to keep, then cataloging them with these fields of metadata. Here's one of the cataloging templates. But this is not gonna scale. The last time I checked, there were 169,000 videos with the tag hash Occupy. Someone is not gonna go through each of those and download them and catalog them. It's just not possible. So we've suggested and we're starting to build a tool for an alternative approach to the selection process. The idea is that we would first have a set of categories like celebrity visits, internal workings, the library, the kitchen, the media tables, confrontations with police, labor actions, housing actions. And then for each of those categories, we will have the occupiers fill out an online form listing the five most important videos in each category. That way where this kind of collaborative filtering is scalable and manageable. It's consistent with Occupy ideas of inclusiveness and of managing their own story. Tamament can still choose to be selective in collecting only a portion of what's voted in, but the total set for review will be a manageable scale, not the 200,000 plus area. But of course, one of the problems that we face with YouTube and these other things, different types of social networks, is that the social networks, if you read the news, this is from a week or so ago, they're starting to enforce their terms of use. This was the brouhaha with employers wanting job applicants to give them their Facebook login and Facebook is weighing in saying that's a violation of our terms of service. So the terms of service issue is a serious issue. On YouTube user agreement, section 5B, you shall not download any content unless you see a download or similar link displayed by YouTube on the service for that content. But of course, there's plenty of stuff on YouTube that are on Creative Commons channels. This is a 1916 movie that's clearly public domain because of its age and there's no download button, but it is public domain, but you're still violating the terms of service. YouTube offers the ability to give a Creative Commons license to material that you post on YouTube, but their ideas are Creative Commons to remix this. They don't have any real concept of to reuse it, to archive it or anything like that. So if you actually look at what they say, it's all about remixing. So what we've pretty much decided is that it's even though different parts of YouTube's user agreement conflict with one another, clearly YouTube has a video editor that they let you remix things with. So if you use that video editor and you just grab those things, you're clearly in the clear because it's legal for you to grab them and make something new. You can make something that's identical to it. So that's legal, but otherwise it's kind of murky and so we have to rely on other things and David will talk a little bit about some of these other things. So here's David Millman from NYU Library. Hi, thanks, Howard. We're tight on time. Christine's slides may not make it, but they'll be available somewhere else, I think. Okay, I'm from the NYU Library and I wanna talk about what our, one of our special collections is doing and Howard set this up so it's not gonna be a big surprise, but I wanna show you literally what it means for us to be doing this kind of step-by-step. First of all, the library is a collection of primarily of left and labor materials and it's a whole mix of stuff, print, posters, media, increasingly born digital stuff, oral histories, and we've been doing web archiving for the last few years there. In terms of this Occupy Wall Street Collection, there is gonna be one finding aid that's the vision anyway and there are bringing in material from several sources and I'm gonna focus on a couple of them, YouTube. I think you just heard a lot about how that's being collected in that library and what the plans for that are. Web archiving we are doing with the CDL Web Archiving Service. We're currently archiving about just a handful of sites. Meeting notes, we have established a pretty tight relationship with one of the Occupy Working Groups and I'll talk mostly about that in a sec. Ephemeral like posters, that kind of thing. A couple of formal oral histories that includes right statements and we are not currently collecting any of the Google or Facebook groups but for completeness they exist. There's a lot of material there. I'm gonna skip this YouTube part because I think that Howard covered that really well. Let me get into the group that we've had the most activity with and I can show you some of the stuff that we've done in detail. I don't know if you can read this but somewhere in the bottom of this screen, describing this group called the Think Tank, it says that they are working with Detainment Library at NYU. There's certainly nothing signed with them but we've spent a lot of time developing a relationship with a set of characters that, as Howard said, changes routinely. Their mission is kind of strategic, I guess I'd say, although it varies and so it's a pretty open forum, it's not about what actions are gonna be taken on Saturday, it's more theoretical. Although a lot of the actions, the initial ideas come out of this group. They meet every day for at least two hours but when you hear some of the audio, you'll realize that it can go on quite a bit longer than that just because they're very process oriented. So with this group, we've provided them with hardware to capture their meetings and so the meetings can be of different sizes, whoever shows up. And we've given them one of these Zoom recorders or actually three or four of them and they rotate in and out. And we collect ships from them every couple of weeks. We also, you can't read this but maybe when I can send you a copy or if you go back and look at the slides when they're posted, you'll be able to read some of what we recommended to them as guidelines for how to collect stuff and we asked them to read approximately Dublin Core metadata onto the front of every session so that there wasn't, we didn't have to collect multiple things and it would be right in line with the audio. And let me just skip to what we did with it. So this is just a couple of quick screenshots about this is our backend repository system. This is the collection for that Tamament Library. We're using the bag it format. So there's the bag info for what chip that was delivered Friday afternoon. We run validation scripts that are not websites that's command line backend stuff. And here's a sample manifest with a check sum of the kind of stuff. So one thing you can tell from this is just a couple of things I point out. One of them is we told them to send us wave files that you can see they send us a whole mishmash of stuff. We also in the guidelines asked them to use certain kinds of naming conventions and they didn't follow that. And the, if you look at the path name on top it's Crucial003. After that we've appended our own UUID just to keep things unique because we cannot rely on any kind of technical metadata in a consistent way. So before I talk about the oral histories let me just play a little something so you get a sense of this. So this first clip, let's give you a sense of the facilitation. A platform and if so what should that platform be? Should we have a national message, you know? So we'd love to hear everyone's thoughts on this and to do that we're gonna have a facilitated conversation to make sure that. So we don't have time to go through all the hand signals but that's pretty cool. Let me just skip ahead. Talking loudly enough for everyone to hear people are gonna go like this that means just speak a little bit louder. We are recording these things. The recordings are being filed at the NYU Labor Archives just for the purposes of information and eventually they'll be transcribed and we'll probably try and post them on the website, okay? And we're gonna send ideas that are relevant to different working groups. So they're at least mentioning us. Let me just play you another one. That's, this one has a little better message. So it's around 12.30 on Monday. Today's the 28th? Yeah. The very 28th? All day. All day. Yeah. So we're having a people's think tank on the discussion topic of how we should be educating people about our message and what we should be educating them on. So lots of Dublin core there. Title and date. So let me stop soon. I wanna leave time for Sharon. We have a couple of formal oral history projects. One thing that's interesting to collect and we're using this with our web archiving is the minutes working group because until something appears in the minutes they won't release funds for activities. So that's a pretty valuable archive and again, we're getting that from our CDL web archiving service. Again, different ephemera that we're collecting that's not digital either. And let me just stop there so that we have time to get maybe a question and we'll find a place to put Christine's slides. You'll be treated to my interpretive dance here in several seconds. So my name is Sharon Lee and I'm the director of public projects at the Roy Rosenzweig Center for History and New Media at George Mason University. And we began a sort of sweat equity project to start to try to archive materials that were being created by, born digital materials that were being created by Occupy groups who were not in New York because we worried that the small Occupy groups were not going to get covered. And so the Occupy archive was born out of that effort. We have done in the past a lot of digital collecting projects probably the most famous of which is the 9-11 Digital Archive. We also did a project collecting the stories of the Katrina and Rita Hurricanes. And so we had some practice doing this but those were both projects that were funded by the Sloan Foundation. The Occupy archive isn't funded at all. The Occupy archive was an effort of staff and graduate students from the Center for History and New Media and then eventually a collaborator from our art school to do something that we thought Roy would like. In honor of the anniversary of Roy Rosenzweig's death, we launched the Occupy archive on October 11th, which was the date of his death, to do something positive to celebrate his life. And since then, we have collected in a variety of ways almost 3,200 items and it allowed us to test a little bit of a workflow amongst our tools. And so what we did was we took the Occupy together list of websites and Occupy sites of about 500 and divided them up into groups of 50 and everybody took one or two and we started to use Zotero to web snapshot everything that we could find related to those individual sites. And then from Zotero, we imported those materials into Omeca and it took us maybe an afternoon to get the Omeca site up and live and that was not too big a hurdle. We do that on a regular basis. We like to tout the five minute install of Omeca and it turns out that on most days it's true. And then one of our designers made us a front end design. So as I said, we've gotten about 32,000 items based on this sort of free time that graduate students and staff members had around the edges in the last, you know, well since October. And you can see that we're getting sort of a mosh of things between images and stories and things like that. We also built a little feed scraper that goes out to Flickr and grabs stuff with certain tags on a regular basis. And so we're pulling in some of the Flickr items. We're pulling in all of the metadata and the thumbnail and if it's got a Creative Commons license on it, we pull in the full size image. So we also have some partners out in the world. I have a friend who works at the University of Utah and he sent his students out down in the Salt Lake to do oral histories, which they then uploaded. And so we have a handful of collections from collaborators around the country as well. But the thing that makes the Occupy Archive like our other collecting sites is there is a contribution portal where anyone can upload a story, a file, anything like that. And so normally in a funded project we would have done an awful lot of outreach to publicize this archive. We have not done as much of that because we just didn't have the boots on the ground or the time or those sorts of things. But we are getting some contributions that way. And you can see that this is the beginning part of the form. It's not a very complicated form but it does allow folks to geolocate their stuff and that's kind of nice. So we get this pin materials to particular places. And we have gotten a little bit, a little bit of press. There was a nice little economist piece about sort of why would you wanna archive this stuff. And I think that as Howard said, it's not necessarily self-evident to most people out in the world why we would wanna save this material. We've also done a little bit of work on the archive ourselves. Our friends at Media Commons who run in Media Rez spent a week thinking about Occupy as a site of creation of media. And so we did some thinking about the graphic media that's in that collection. And so that is where we are here in April. That's about as fast as I can do that. Questions, comments. We'll post Christine Hannah's slides on Archive It and what the internet archive is doing for archiving websites. But questions, comments, no one? No comments? No? Trisha who said, who's from Oakland, California and who said she was gonna set up a tent in the back of the room. Yes? Hello. So a quick question. We were wondering, I think it's great that where you can black someone's face out. What about someone's voice if you're doing recordings? Is there a tool that you know of that can disguise somebody's voice like a helium talk or something? Yeah. Witness actually the human rights organization that we were closely with actually has a set of guidelines for doing interviews that include if it's a video interview, have a heavy backlight so that they can't make out the person's face. They have some recommendations as to voice distortion as part of that. So we haven't yet started examining that but we've certainly talked with the witness people and they actually do have something like that because they need it in human rights issues. Are those open source tools? All of them. Great. Yes, we only use open source tools. That, it wouldn't be very occupied of us if we did not. Thank you, thanks for feeding it to me. Other, other questions, are we out of time? Lane. Only your Oakland friends are asking questions. I was curious when you were trying to make the case to these groups about doing all of this and you know you tried to use their language, their sort of framework and all that but did you ever have occasion to show them archives from the past for similar movements like the free speech movement or anything like that and what was the reaction? Yeah, absolutely and that is one, that is a highly effective thing to do except when you're in a screaming match in the dark you know and people are saying, yeah, you archivist, you know, you freeze history. You know, except in exceptional cases like that, that's highly, highly effective and we do talk a lot about previous movements. We ask, you know, one of the first things we do is say, well, do you know about May 68? Do you know about the free speech movement in Berkeley? Do you know about the sit-ins at lunch counters in the south? Yes, yes, well how do you know about that? You know, where did you find out about that? Oh, I read a book. Well, the person who wrote the book, where did they find out about that? They looked in archives and. There's another, oh, I can do it from here. Another interesting process thing we did with them when we were working on guidelines and those kinds of things was using the same consensus method that they use among themselves. So we actually like agreed on metadata fields that way and it was pretty fun. Like we all read, you know, their philosophy the night before and so we kind of got ourselves to a place of consensus in this really strange way but it worked pretty well. It's too bad they didn't collect those fields but we got the consensus. Yeah, I mean, in terms of, I mean, what David just said, there's a big difference between what people say they're gonna do and what they actually do. You know, when the first batch of Think Tank audio recordings came back, no one had, they all said when we used media info when we looked at the dates on it and they were all like January 2011, you know. There was no occupied in January 2011. They hadn't, they hadn't set the clock. They had, you know, or they'd removed the battery and, you know, I mean, there's just like endless kind of issues where someone forgets to do something. So we've really worked very hard, I think, between David's group, between the Tamin people and between my people to have redundant ways of collecting these things. So, you know, there's the script they're supposed to read on the audio. There's the file naming convention that has some information. So redundant information so that because they're just not gonna follow, in all cases, they're not gonna follow all the guidelines. Well, thank you all. Thank you.