 Hi everyone. Thank you for attending. This is the 11 o'clock session in our after the harvest preservation access and researcher services of the 2016 end of term web archive. So we'll talk about what that is and we'll kick it off and we have plenty of time for questions and discussion at the end. So there's been a good bit of attention about the preservation of government and presidential web content due to the transition and talk about some of the issues around that at the end and encourage folks to comment and feedback. All right. Hi everybody. So I'm Abby Grotke from the Library of Congress and we have Jefferson Bailey from Internet Archive and Mark Phillips from University of North Texas Libraries and we are three of the many partners, many thousands of partners. Now that's an inside joke. We have about eight partners that we'll talk about, partner institutions helping build this archive. So we're going to talk a little bit about the history of the archive and what we're doing now and also what our plans are for thinking about researcher access to the collections. So we like to say it began a long time ago, far, far away in Canberra in 2006 I think or 2008. Okay. And we were meeting at the International Internet Preservation Consortium meetings during that time and we had just gotten the word that our colleagues at the National Archives who are in the room but we're not going to do an end of term dot gov crawl of or crawl of the dot gov domain as they had done in 2004. So a number of us sitting around the table we were already doing archiving of this nature our own efforts in this area. We just looked at each other and said well somebody's got to do it. So at that time the Internet Archive, Library of Congress, California Digital Library, University of North Texas, Government Printing Office at the time, now they're the Government Publishing Office, gathered together to form the end of term archive. And at the time we're all IAPC members, we're also all end-it partner and DSA members as well. So we were already working together on a number of projects this just seemed like a natural fit. So like I said we were already doing a bit of government crawling our own. I won't go into all the details of this but basically there's a lot of activity from a number of institutions around archiving dot gov. We like to say the U.S. domain is not easily defined neither is the dot gov domain. So various organizations have taken different slices of it depending on their research interests and needs. But by coming together we could really build a community around this and expand our collections. So there were again also some other community efforts and and more in recent years there's something that we're a part of the federal web archiving group. There's other research projects and citizenprojects to document dot gov. So the goals were to again work collaboratively to preserve the government web document the federal agency's presence during a time of presidential transition and enhance our own research collections. Part of the piece of this is that we're all doing collecting but then we bring the archive together and then make preservation copies and share it with partners. So there's more than one copy available of the archive. We also wanted to raise awareness and we can talk a little bit more about this year raising the most awareness in terms of outreach and and making people aware of the importance of preserving government information and web archiving. And we also wanted to use the data to engage with researchers who involved in web archiving and other subject experts. So the distribution of work has changed depending on the year. We started in 2008. We did another archive in 2012 and again we've gathered again for this end of term transition project. So over time we've adjusted who's doing what and what pieces of the web we're all crawling but you can get a sense of of the things that need to be done. Mostly we've been focused on collecting and preserving but we are looking more at access and making things available which Jefferson will talk about a little bit later. But we all sort of pull together collect content. Some of us work on the access side. Some of us work on coordinating volunteers that we'll talk about in a moment. Other people are extracting metadata, archiving social media, things like that. So we have a lot of funding. No, no, we have no funding for this project. We have no grants that are currently supporting this activity. We're all sort of contributing our own institutional resources as a part of our own web archiving program. So we like to joke that, you know, it would be nice if we had a lot of funding coming in but actually there's only a handful of us really committing resources at our own institutions for this. It's not to mean we won't eventually get some funding for some aspects of it and we can talk about that maybe in the Q&A. So defining the web presence has been one of the biggest challenges. We started with the 2008 archive with a few data sets. One was the crawl that NARA had done in 2004. So we had access to that list. We had a couple other lists of seed URLs which was the starting point for the crawler. And we've built, we've grown that list over the years. So we start with bulk lists of domains. There's not one big list that we can consult that has everything. So it's been a bit of a mishmash trying to find all the sources. It's gotten a little bit easier this time around. We have access to more from data.gov and USA.gov. And we actually for the first time this year have received a list of URLs from Google. So they have provided a list of, is that how to articulate what they gave us? Yeah, a list of URLs of anything.gov that they were aware of. So all of this together becomes a starting point. But then we also rely on other folks to help us identify specific things that are of concern to subject experts and the community at large. So a big part of what we're doing is, you know, we have this bulk list but we also engage the community. We've done a really great job this year. We're really excited that we got some press in the New York Times a few weeks ago. One of our partners, Debbie Rubina, who participated in 2012, she actually got a class of hers at Pratt to identify social media content for us. She gathered together a group of subject experts in New York a couple weeks ago and also today, I think right now, she's holding a second one. We're also seeing other events sprouting up that we really haven't done anything to create. So these are coming out of community interest in the project. You pen, you see Riverside, Toronto, and then other organizations are holding little hackathons. So it's been really fun and exciting for us to see the community engage around the materials this year. We also got some more press today, so we expect more interest. There was an article in the Library Journal and also on Motherer Board, a vice publication today. So we're getting a lot of interest this year, which we wish for the quieter times sometimes that we didn't have as much attention on the project. So the volunteer contributions we've had since the beginning, 2008, we had about 500 nominations from 26 nominators. 2012, that increased greatly, and then we've just updated our numbers. So far, we expect this to increase as we get closer to the inauguration. So we'll talk a little bit about the schedule later on. So government websites can be pretty much anything we've discovered in this project. We have .gov. That's an easy one. But there's also a lot on .com and .edu that we discovered in some of the earlier projects that aren't really showing up on some of the lists that we have access to. So that's where it becomes really important that we have the community help us identify some of this content. Oh, Jefferson has good lingo for these slides, and I can't remember what it was. We sometimes jump around, who does the slides? You want to jump in and say, just they're invasive. A web of femorality, sorry. Some websites we don't have access to again in the public list. They're not listed anywhere. So having users who are familiar with this content has been really important to help identify it. There's certain content we can't archive in terms of things behind passwords and logins and databases are a problem, and there's been a lot of interest this year particularly in that and in data sets. There's just a lot. There's a lot of government content. So trying to figure out what it is we need to be preserving during this time and trying to ensure that we are doing a good job is tricky. So the end of term archive, the 2008 and 2012 archives are available through a portal that the California Digital Library is hosting, Internet Archive is hosting the archive content, but the CDL interface does the metadata and search and browse. We recognize that this is probably a bit outdated and we need to do a little bit better. Our partner, our old partner Tracy Seneca is here and she helped us set this up, but that was a number of years ago. So particularly now that there's a lot more interest in this content and making it accessible to research use, we need to do a better job on the access end. Just to mention a couple of other efforts, we have a Twitter account that you can follow to see updates from the project and also there's been other papers published around the work that our volunteers have done, so it's really exciting to see and also think about what might come out of some of these data thons or hackathons or nominateathons, whatever we're calling them. There's also a number of other efforts around this time that are related. The Internet Archive has done an election crowdsource collection, other institutions like the Library of Congresses. We also archive campaign websites, it's separate from this archive but it's sort of all related and I hear the Internet Archive is also doing some work with the White House in terms of documenting the web and social media data as a part of the transition process, so I'm sure Jefferson can answer questions about that later. So handing it off to Mark now to talk a little bit about what we know about the archive we have collected so far. So one of the things that was kind of interesting about this collection is so we tend to get together for the identification for the capturing and we do a big data transfer. Any of the institutions can take a full copy of all of the data, LC takes a copy, the Internet Archive takes a full copy and UNT takes a full copy and then we kind of wait for three years and go about our normal lives and come back but one of the things we kept getting asked was well now that you have kind of longitudinal data, you have 2008 and 2012, surely you guys you've learned a lot and you can tell us what's changed and we don't really have a lot of those numbers. We did a little bit of kind of exploratory research on this for paper this summer at the Web Archives and Digital Libraries workshop around JCDL and that's what a little bit of this is going to be presenting on. But just some of the numbers of the archives, the 2008 is probably the set that we've actually done the most work with. There was a research grant that UNT got to look at classification of the Web Archive and we actually got to dig a little bit into that. But the full 2012 was actually quite a bit larger than the 2008. One of the questions we get asked a lot of time is who coordinates all the crawling so that you guys aren't duplicating effort and when we say no one really is, we kind of just do our own thing and everyone says but you're wasting all of this effort. One of the things we looked at was just PDF content that we had between the 2008 and 2012 and so the big pie chart in the red and blue is the PDF content that was red is the stuff that was not present in 2012 that was present in 2008. So missing content from the archive and then the blue little slice is the amount, the content that we found PDF content in 2012 that was there in 2008. So it's a pretty small set of the pie and this wasn't based on the URLs. It was actually based on the content, content hashes of the files. So the files could have migrated to different parts of the federal web and so there was actually quite a bit that wasn't available just PDF content. And then when we look at the overlap and how much we crawl, so the green, so the blue pie chart is the amount of content that is unique to the Internet Archive in the 2000 or in all of the crawls and it's only a small slice that's actually duplicative of other archives and you see that with actually all of ours. So the more we crawl, the better picture we get. There's overlap but it's the good kind of overlap, it's kind of reinforcing content that we're not getting from others. We also got asked when did you guys crawl and I think this kind of gets into more important ideas on especially these really large collaborations on collection and then you actually have a researcher who wants to use this collection and they have all these questions like when did you crawl it, how did you crawl it, what did you do and a lot of this we didn't really have great numbers on but just this is a nice little graphic of the 2008 crawling process in the 2012. So 2008 was different than 2012 and you see the big green spike which was content that was captured right after the inauguration and then when we went to 2012 and there wasn't a new or there wasn't a change in office we had a little bit different crawling pattern because we knew there wouldn't be the same kind of volatility that we expected from the years with the transition and so I'm sure our graphs will look more like the green with a quite a big spike after the inauguration. We've been doing kind of a slow burn crawl going up into it and then there'll be a spike of activity as we go but a little bit of the breakdown but looking between the two. So the 2008 collection and the 2012 they're roughly the same amount of top level domains, the .coms, .govs, .mills but there's there's only 225 that we have in common between the two so there's some outliers that are kind of interesting and weird. When we start to look at the domain names themselves you see that we obviously have more domain names than in 2012 and that there's actually not a lot I mean there's a lot but you would I would I always expected there to be more overlap in the domains but when you go down to subdomains you get even even more and surprisingly not as much overlap and so you really do see that these these two collections as the web evolves and as the the federal government continued to continued and continues to use the web as a vehicle to communicate its goings on you see kind of this just more and more content that goes out there. We looked a little bit at the biggest change and some of this was planned so one of the things that we did in 2012 was we really started to capture social media so there are a lot more .coms there are a lot more of the the the non .gov non .mil we also said we really need to do a better job of capturing the .mil content we had gotten some of it in 2008 but we really focused quite a bit more so you'll actually see we captured 484 percent more .com content in 2012 than we did in 2008 we actually went down and .gov content which is weird and then we had quite a bit of a bump in the amount of .mil content and then we had this the .ly we had a huge jump in the amount of content and we we actually saw a lot of evidence of one of the things that happened between 2008 and 2012 was link shortners and we hadn't really had a whole lot of evidence of that and then suddenly in the 2012 you see that all over the place so our two largest changing sets are .ly and ME and GL and these are all variations on link shortening which is what's really kind of an interesting phenomenon that when we look at the web and we look at the web archives we know but we don't necessarily think we're going to see in all of this content the biggest when we just look at the .gov and the .mil content we see the the biggest change we had was in the the house and the senate or sorry not the biggest change but like the the top the top sets and you can see the change over time the biggest ones you see we have osd.mil and navy.mil and this kind of goes along with the idea that we said we would capture more .mil and we can show that yes we we did improve that and then we gained quite a bit in the house and senate because of some of the different ways that we were calling that over time as well. Some interesting things to look at are also things that existed in um 2008 that weren't in 2000 that domains that just didn't exist and and not only just domains but like domains that had lots of content there and sometime and usually this is really well planned um you there was there were some initiatives that went through and have gone through to reduce kind of the the growth of the domains in the federal government and reduce and make it simpler um and you see some of the effort there but you see um the like geodata.gov got folded into some other work um there's uh a number of others and these are the the URLs the number of URLs that um were present in the 2008 that there are no traces of this domain anywhere in the 2012 content um which is kind of interesting. The other thing that happens then is content that's new and so these are the the new parties to this to the set and so you have a lot of the .mil content that we weren't getting and then you have a lot of the initiatives and like if you're familiar with the obama administration which I'm just slightly as kind of looking at this you see some of the things like the transparency.gov and some of the initiatives that we kind of associate with that administration coming in and becoming more um more visible in the archive and so as we go through and and look at the 2016 once we're set and done with that we'll be able to actually start to look at this longitudinally over three different sets um we wish we had you know more more sets in between but you know at least we have these four um four-year data sets as we go forward but um I will turn over to Jefferson now. Hi uh so yeah one of the things that we have been trying to do especially because it is this longitudinal snapshot ish it has some organic character to the end of term collections uh is figure out how to support researcher access in a variety of ways so not just uh computational analysis which obviously is growing in popularity uh but also just insight to some of the things that Mark mentioned about like provenance uh capture information uh and information about the crawl beyond just uh replay of the web pages so one of the projects that we did uh started about two years ago or so uh was work uh in providing uh working with a third party which is uh Altascale which is a sort of Hadoop in the cloud service provider uh the people that created Hadoop at Yahoo spun off formed Altascale uh and they do a lot of uh support of data mining projects and computational infrastructures uh for research use uh and so we extracted a lot of the .gov content including a good bit of uh the end of term collections uh put 100 terabytes into their research cluster uh and then waited for people to show up uh and the build it and they will come model maybe not yet quite mature enough for data mining research at least in the sort of academic social science community uh but we did have uh at least three or four users probably uh the most successful ones were uh political science department at University of Washington which does a lot of poly informatics and is already doing data mining uh of government content legislative and and digitized materials at scale uh and Rutgers which is looking at uh the sort of uh communications and social aspects of uh government content and change over time so uh they have both published on their papers and looked at things like uh term frequency changes over the course uh of uh you know the two end of term collections uh how uh government properties around specific topics uh emerge, disappear, content changes, content drift um so there are a couple publications out there if you look for like end of term or .gov data mining um and there are there are good examples of this uh but we haven't necessarily gotten the uptake we would have expected especially since this was a fully subsidized project people could basically log in use the data uh all the compute was also subsidized uh by all to scale so it was basically uh come and use the infrastructure and I think there's still a lot of hurdles for all of us to get over as far as providing the sort of support mechanisms the educational uh aspects the training even just like account administration uh of using data mining platforms uh so I think there's work to do there but we've certainly seen a lot more interest and gotten a number of researcher emails already and you know we've really only started crawling maybe a month or two ago so I think there'll be a lot more of this type of research in the in the future another way we're trying to give access to researchers is through derived data sets so for people that don't want to deal with 100 plus 200 terabyte collections or don't have the infrastructure to take a copy of that uh we will do data mining internally in our own cluster uh and try to create data sets that can inform specific type of research so uh there's sort of extract certain metadata from each individual resource like page title uh links anchor text things like that and put it in a small much smaller file that still contains uh useful information for data mining across the whole collection uh but as much easier to deal with uh we also do link graph files uh network analysis is quite popular so this is uh what links to what with the timestamp and you can of course track that uh over time and there's been a couple of efforts around this with uh the Canadian government content uh which we've worked with some folks in Canada on and there's uh if you go to webarchives.ca there's a good network analysis of Canadian political parties it's a little more political parties not necessarily government uh but you see how uh departments and agencies link to each other and then don't and then get in fights and stop linking to each other uh and so forth so graph analysis is is super popular as far as web resources over time uh and then we also extract uh named entities so uh that's people places organizations and those names and all of these of course are associated with timestamps they're associated with the URL that they came from um so you can do data mining like that and these can be delivered usually is JSON files that people can interact with and we've also been uh building APIs on top of them uh the other another angle of research support is sort of uh the more hackathon community uh oriented model so we do uh work a good bit with archives unleashed which is uh mostly NSF funded uh project to get researchers to use web archives more we had an event at Library of Congress uh in in March June uh and one in University of Toronto I guess was March uh so two last year we'll have at least two uh coming up in uh next year and a lot of these we use the .gov data sets not all 100 plus terabytes but smaller portions of it like just White House dot gov or things like that uh that people can work on locally at their laptops with uh engineers and archivist and and collection people there in the room with them so those have been super successful uh there's web archive CA uh and we've also supported similar style hackathons using .gov data uh in Europe so the .gov stuff is public domain so it's very easy to give out to give to people uh and gets around a lot of the issues with trying to get permissions or rights or whatever uh so a lot of hackathons that are doing uh working with web archives end up using the .gov stuff uh so a semi semi affiliated but I figure I'd mention it uh is uh a lot of this the site search capabilities that we're putting on way back machine at internet archives so web-beta.archive.org uh and some of the things there are driven by the same desires to give information insight and provenance to captures um and that is useful for .gov so uh one thing here on the left is there uh APIs around capture information uh at a host and domain level so for instance here like justice dot gov or White House dot gov uh you can actually hit this end point and look at certain MIME type the number of MIME type captures uh per year as well as information like is it a new capture or is it a revisit so something we already had that hadn't changed um and so that's super useful for people that are interested in looking at a domain change over time uh and then there's similar visuals uh on top of each domain so graphs around how much content was image or how much was media uh and so uh there's a good bit of those we've also been doing uh extraction projects uh for specific uh domain level or even host or sub-domain level uh around specific type of content that people would be more interested in uh so we have a PowerPoint and PDF search and it's basic faceted search you can look by the domain the year uh the MIME type uh and then do uh a sort of keyword searches kind of thing which is mostly running on anchor and meta tag and title text so it's not necessarily full content text but once you start mining across domain and year and content type you can get uh pretty good search results um other extracted special collections so this is one that we did for our 20th anniversary uh which was extracting every PowerPoint from the dot mill web domain so if you're interested in looking at military PowerPoints uh there's like 50,000 of them in this collection um all associated with uh hosts that they were captured from so you get some pretty insane graphics so I I it was it was there's a great article about military PowerPoints uh if you search for Paul Ford military PowerPoints uh and it's just this sort of mind boggling mix of uh of war bureaucracy and in PowerPoint culture and I have no idea where that ends up but it ends up somewhere very weird um so this year uh so for the EOT 2016 we actually had a number of new partners join uh so the core partners of course remain uh but George Washington University Libraries which is here in town is uh has built the social feed manager tool it's talked about a little bit in the last session um they are using this as an API based capture method so they're not crawling they're hitting like Twitter and Weibo and the not for government but uh Tumblr and a couple others uh and doing social media capture via API as part of the project so that's super cool and exciting uh Stanford University Libraries which has a big government documents program is coordinating a lot of other FDLPs and GovDoc libraries uh to contribute to the to the seed the the crowdsourced mining method uh and then the federal government web archiving working group has also been contributing um uh they have helped us in occasions like when the Senate blocks our crawlers which happens quite often they have good connections to the congressional body so they sometimes help when we encounter technical difficulties on the on the server side uh so our time frame you know we sort of kicked off in the summer we have been having regular calls regular meetings lots of outreach especially around the public nomination of URLs we've all been crawling for a good while now uh and the sort of time frame for the crawling is that will stop before the transition the day before the transition and start uh the day after so that there's a little segmentation between the pre administration change crawl and the post administration change crawl and it'll run a couple months after that so there should be uh a good documentation of change at least immediate change uh once the uh administration turns over uh it'll take a little while for the full tech search and metadata and get it into the public access portal but all that should happen uh sort of mid next year uh and I'm excited at least to explore other research opportunities uh so what have we thought about this year uh expanding acquisition I mentioned the API based harvest so we're using a number of different crawling mechanisms uh that's not just regular link hop crawling uh but also browser automation, browser based crawling, the API captures so there's definitely uh uh sort of uh panoply of crawling strategies that we're doing a lot more of which a lot of us are doing you know in our day to day uh capacity anyway we're exploring some of the things like the new search tools, new indexing tools so we'll probably do uh some language analysis to see what type of uh you know the volume of sort of Spanish language content on the dot gov domain uh which is also something we're working on uh we've seen a lot more community engagement around nominating seeds but we've also had a good bit of interest from the federal government itself as well as uh for helping us out get seeds so we've been using things like the U.S. digital registry API to get all the social media accounts uh a number of other government APIs uh some of which have just they've given us dumps of the content uh so we've seen uh those you know the seed nominations and how we discover things and Abby mentioned that Google was interested in the project and it's also helped out uh we've been getting just much more stuff to crawl uh which is exciting uh we've also seen more educational opportunities uh we mentioned the Pratt class but there's there will be a couple of other classes at other LIS programs uh or undergraduate programs that are going to be talking about dot gov uh in the into term project and looking at you know government change and how it's represented on the web uh and then researcher engagement so I've mentioned most of these already so I'll tie it up so that we can uh chat a bit now uh what are the we've talked about all the positives so what are the challenges uh you know content is complex there's much more dynamic content on the web uh than there than there was for eight years ago uh some of that is challenging to capture with you know the with scalable crawling technologies uh there's obviously just much more content on the web uh the social media registry alone had over 10,000 registered social media accounts for government agencies and those are just the people that bothered to register uh with the gsa registry which is probably not all of them so uh that's a pretty mind boggling number itself uh and obviously it's uh it's a best effort what we can do with our time and and resources and and such so uh there's not necessarily a high degree of qa beyond uh the automated qa methods that a lot of us use uh already there's a lot more partners so a lot more partner management and project management work uh and a lot more seeds uh so yeah and uh technical and time limitations obviously the the presidential change will happen regardless of our crawling strategies um so there are some challenges uh uh the uh what we have as far as seed list uh I think we have over about over 200,000 unique domains that we know about that are in the crawler uh google gave us a very big index dump that we haven't finished parsing yet uh but that's url specific it's not domains so the domains are basically set up to scope in subdomains scope in directories and all that sort of thing uh so I think we'll have a very definitive uh list we also mine a number of search engines the global way back historical uh as well as third party sites uh and we get donated list of uh domain registries so we're not lacking in seeds I'll say that uh how you can help uh judicial branch websites uh sometimes we've heard from a number of people around the uh environmental and climate data sets this year uh we were getting a very good capture of data dot gov but there are sometimes ftp servers uh are other directories that the crawler might not discover because they're hidden or their url is generated in a really wanky way um so if there are sort of sub directories are very hard to find or many hops off uh that have rich content in them even if that content is not necessarily uh you know direct links uh those things uh are very useful to identify we've had a lot of back and forth around gov sites on non dot gov um and that's a judgment call with some of those like the frb's federal reserve banks are not dot gov but clearly their government entities uh national parks are similar so uh the more nominations the better and we'll probably just crawl it all anyway so it might as well nominate it um so there is a nomination form here's the url you can probably google it to you uh and it does allow for dedupe so if somebody's nominated something uh you know you'll know that as well as adding some metadata so that'll be useful for when uh this all ends up going into the portal um so what's our plan we're crawling as much as possible and we'll probably crawl even more uh as the as january 20th comes up uh access uh there are i think obviously there's the portal for looking at the captured pages and sites um but i think there's other access methods that we'll think about for this year uh we do share all our data across the partners and that's uh open for sharing with others as well uh and then there's regular communication and outreach so we did get some uh press today there's a the mother board article in a library journal article so it's been really interesting to have uh the media entities reach out to us i'll say one uh can i say one of the challenges one of the challenges is that some institutions may have some political sensitivities around uh framing this project entirely as a um hysterical hysteria driven rescue job due to uh imminent changes so this project was predicated on the idea that uh web content does change drastically at every presidential administration certainly this upcoming presidential change uh given past statements has even more potential for volatility and immediate disappearance and politically driven ephemerality instead of just benign neglect ephemerality um but sort of talking to the media about that has been has been interesting uh because i think it's it's it's a valid point but it's also one that is informs the project but does not necessarily uh drive the project outright so i think that'll be an interesting uh point for discussion hopefully so that's it there thanks