 All right, I think it's about time to get started. I'm Cliff Lynch the director of CNI let me welcome you to the second day of project briefings at the CNI virtual fall member meeting. And I'll just say a couple of very brief things. And then we'll get started. So you have a chat box which I invite you to use to introduce yourself if you wish or make comments as we go along during the session. There's also a question and answer tool at the bottom of your screen. Please use that to pose questions for our speaker will deal with all the questions at the end when Diane Goldenberg Hart from CNI will beam into existence and moderate the Q&A part of the session. I also note that there is closed captioning available, please use that if that's helpful to you. And with that, I think that's. Oh, one other thing I should mention we are recording this and the recording will be available through all of our usual mechanisms pretty soon after the session. So with that, let me just introduce the topic very briefly. Our speaker today is Sean Jones. Sean has a complex multiple affiliation. He's a member of the Los Alamos National Lab where he works with colleagues such as Martin Klein who's well known to our community. He's also in the Web Science and Digital Libraries Research Group at Old Dominion University, where he works with Michael Nelson, who will again be well known to our, to members of our community. And Sean is going to talk about a very interesting topic today. And certainly the way I'd frame it is it's gotten increasingly easy to create web archives to do crawls. It's really hard to sift through them and to understand what you've got and Sean is going to give us an approach to that that I think is quite novel. And with that welcome thanks for joining us Sean and over to you. Thanks Cliff I appreciate it. Yes, as I said my name is Sean Jones and I work at the Los Alamos National Laboratory Research Library prototyping team. I also work for the Web Science Digital Libraries Research Group at Old Dominion University. And today I'm going to be talking about summarizing web archives through storytelling with the dark and stormy archives project. This is key to my dissertation research, and I'm happy to share it with you and I really appreciate you all coming out and and listening to me because I think this is an exciting concept. And I'd also like to thank JCDL CIGIR and IMLS for funding parts of this research. Let's start with the abundance of data and web archives. Web archives contain numerous documents. The way back machine now has 898 billion URLs. This was reported by Brewster Cale, and on February 5 and 2020. And that was up 17 billion from the previous month. And the UK Web Archive collects millions of websites with billions of individual assets. And as of 2017, they had approximately 500 terabytes of data. And it was increasing. And I also like to highlight this particular archived collection. Government of Canada publications which has more than 339,000 results if you just open the page to the collection. And mementos are the documents and web archive collections. Web archive collections consist of seeds and crawls of those seeds. The mementos from those crawls are versions of pages from the time of the crawl. Now, whether the collection was created with Conifer, formerly known as Web Recorder, or Archive it, or the UK Web Archive, or the Internet Archive, or the 30 other web archives, public web archives that exist, or the many private web archives that exist in the world. Mementos are the documents in the collections. And web archives have many versions, many mementos of the same page, and this is a feature. It allows us to do things like view the University of Utah's Office of Admissions over time that changes in their organization. Or we can follow the evolution of a movement like with this Black Lives Matter blog. And different versions, different mementos allow us to see an unfolding news story. This particular article was written by Bostino. This is a marathon bombing manhunt. We have a memento from April 19 at 1712, whereby this article shows that they're searching for suspects in the city is on lockdown. We have a memento from April 19 at 1759, where we see that an officer's in the hospital, the lockdown's been loosened. And people are wondering if the Red Sox game is going to be canceled. And we have a memento from April 24, where the suspect has been found. An officer lost his life and Obama is going to speak. These different versions are important for historians, they're important for first responders and a variety of other researchers, they're trying to understand, not only what happened during the event, but what we knew at various points in time. And tools exist that easily allow users to add content to web archives in places like Archive it, where we can control crawls, add seeds, add metadata in places like web recorder, where from the comfort of our own browser we can manually create mementos. So the Internet Archive save page now and archive today's on demand web forms that allow us to easily add new URLs to existing public web archives. And these collections are used by other researchers, they live a life after their curator has stopped adding to them. The collection curator is not the only user of the collection. Consider archive it as an example. There are many collections on the same topic. 31 archive it collections match the search query human rights. If I want to conduct a study if I want to do some research. If I'm starting a project on human rights, which one is best for my needs. How are these collections different from each other. And metadata is there, but it doesn't always help because it's optional. It's not always present and metadata on archived collections comes from many different curators from different organizations using different content standards, different rules of interpretation, and thus it is inconsistently applied. This means that a user cannot reliably compare metadata fields to understand the differences between collections. We did a study in 2019 where we discovered that the more seeds exist in an archived collection, the less metadata is available. And this is paradoxical because the more seeds exist in a collection there's a greater user need for metadata. And reviewing mementos manually is costly. Some collections have thousands of seeds, each seed can have many mementos. In some case, this can require going through hundreds of thousands of documents to truly understand the collection. In addition to more mementos and web archives, more archived collections are added every year. There's more than 15,000 collections at archive it. And the problem is, is that there are multiple collections about the same concept. The metadata for each collection is non-existent or inconsistently applied. Many collections have thousands of seeds with multiple mementos. And thus you get this multiplier effect of many collections, many seeds, many documents that one must go through. And human review of these collections of these mementos for collection understanding is thus prohibitively expensive. And we can look to existing solutions. We can look and see what people have done. Ideally, a user would be able to glance at a visualization and gain understanding of the collection. They could take a web archive collection, pipe it through an automated solution, and then produce a visualization that conveys understanding at a glance. But techniques that apply to every page are hard to scale up. And they're difficult to understand. Now, my first thought on this was timelines. Just create a timeline. And we find out the timelines provide an overview of where, of the amount of time and where things are, but do not provide understanding at a glance. Users still have to drill through the timelines in order to look at the individual mementos in some form. And thumbnails work well for small collections or studying changes in an individual page, but they do not scale. And often are hard to read because they're too small. But what if we could apply a visualization technique that users already know how to process users would require no training and could instead get information immediately users could quickly decide which collection meets their needs. And that brings us to social media storytelling. Web surrogates provide a visual summary of a web resource drawn from the content of the resource. We have an example here. Dr. Michelle Weigel's homepage as returned by Bing as a search result. It has text and title and content that is drawn from the content of the resource. We have another form surrogate browser thumbnails screenshots of how this particular page, I have how these mementos looked at the UK Web Archive. And then on the bottom right, we have social cards. This example from Facebook. And social cards are summaries of web resources. We have a URL like this one here, and then we represent the same URL as a social card. And suddenly, we know that it's not just opaque. It provides us directions from Old Dominion University to Los Alamos National Laboratory. But in addition to that social cards have titles, text images. And they also have attribution to where the card was generated from. Thus, summarizing the underlying resource. And we use these cards all the time in Facebook, Twitter, Tumblr. This is the same page represented in five different services, and I haven't even brought in other things like Apple messages. And social media storytelling uses groups of social cards to provide a summary of summaries. Each social card summarizes a web resource, each story summarizes the social card groups the social cards summarizing a topic. Social cards contain the same information in the same place on each card. And this is neat. This allows people who are looking at the story to compare images to compare titles to compare sources. And we want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm. So social media storytelling uses an interface that users already understand how to process users will require no training, instead get information immediately. Users could quickly decide which collection meets their needs. And so our solution is to apply social media storytelling to web archives, moving from something like this. An archival collection about the Boston Marathon bombing, consisting of 2400 mementos. A social media story of around 28 cards describing the same thing, but drawn from the collection. This storytelling with web archives has multiple use cases. Archivists could promote their collections, making others aware of it. Individuals could explore aspects of collections. They can explore a collection and expose specific sides of a news story or focus on particular people or places that exist in web archives. And finally, summarization, because web archive collections are too large for manual review. We need a summary to understand what they contain. And some services equate stories with ephemera, which is not what we want. Facebook, Instagram and Snapchat all delete user stories after 24 hours. This is the opposite of archiving existing social platforms do not work well for mementos. We took mementos and tried to put them into a Twitter moment as you see on the left and Twitter did not produce cards for any of them. Now on Twitter, it may produce cards occasionally for some mementos, but Twitter is not reliable in producing these summaries. And Facebook as we show here is also not reliable for producing these cards and existing card services created confusing experiences for mementos. We took a memento from an archiving collection about the Egyptian Revolution and we represented here with embedly. Embedly is a card creation service. Embedly looked at the underlying content pulled out a very beautiful image about the Egyptian Revolution. And it extracted a wonderful text snippet describing the underlying content, and it was able to incorporate the title, but this is confusing. Because this card simultaneously attributes the content to CNN and archive it. And this is a matter of trust because different sources are attributed. And current social media storytelling solutions don't work with archival content. Our stories should not be temporary. Our summaries should not present users with confusing attribution. Archive collections are vast and hard to understand understanding them as difficult with existing visualizations and existing social media storytelling does a poor job with archived pages. So we can solve this problem with the dark and stormy archives project. The story with web archives has three basic steps. One, select the pages for your story. Two, gather information to summarize each memento, and three, summarize all mementos together and publish the story. Now if you remember, neither social media services nor card services like embedly were reliable for web archive storytelling. So we created memento embed. Memento embed is an archive aware surrogate service. Memento embed is able to separate the information from the archive from the information of the underlying content. And thus, we can take a memento about the Egyptian Revolution in archive it. And put it through memento embed and produce a card like this, where contains information from the memento content, like the title, text, and image, as is present on so many other social cards. But it also provides us with correct attribution to CNN. And it tells us that it was preserved at archive it. And because it's archive aware, it can pull in other information, like that this particular memento is a member of the collection of Egypt Revolution and politics. And that it was captured on 2011 in February. It can also provide links to other resources, or other mementos for this resource, and link to occur the current version of this resource if it still exists. And we can summarize individual pages with memento embed never in different ways. Memento embed provides cards like I showed, but it also can produce browser thumbnails and word clouds. And we're experimenting with new types of surrogates, like the image real where memento embed has gone through the images of the page, analyze them, scored them and ranked them. This is the top five and produces an animated gift suitable for sharing on Twitter, or another place where perhaps the amount of content that you can post is limited. And we extended the idea of the image real to the doc real, where in addition to just extracting the top five images, we can also show the top five images and the top five sentences. Now image reels and doc reels are still experimental. They can produce cards with memento embed. But how do we create stories with rain tail. We created rain tail to tell stories with mementos. It publishes to file formats like HTML or social media services like Twitter. It also supports templates for custom output. We can publish many mementos with rain tail in a variety of formats, such as Twitter threads that I mentioned on the previous slide, or GitHub pages with cards and so much more, or videos as you see on the right. And this is another experiment where we're extending the doc real concept to entire collections with the top right images and sentences suitable for YouTube or Twitter. Rain tail uses templates to let you customize the look and destination media of your story. Whether you want to put memento embed cards into blogger, customize the Twitter threads that are output produce output for media wiki pages, or brand your own stories for your institution. We already had to develop rain tail for research projects and explore different visualization types, it was not that difficult to make it so that you could brand it your own way. And so this was always part of rain tails ideas. You can include in your stories, thumbnails, if you wish, or not various images, perhaps you want to have the second ranked image as well, or the fourth. You can possibly include titles, if that works best, and not include snippets, and of course interview, a course include attribution. So now we can create stories with rain tail, but how do we select mementos. And thus we get to hypercane hypercane provides intelligent sampling of web archive collections, and it works with memento compliant archives to discover mementos, but it can also create new mementos if necessary. It also reports on collection metadata, named entities terms image analysis and more. And it provides summarization steps like clustering filtering scoring and ordering. Thus we can move from a web archive collection consisting of thousands or millions of documents to an intelligent sample. A manageable number. The DSA toolkit provides a solution for each stage of the storytelling lifecycle. One can select pages for the story automatically with hypercane. They can gather information to summarize each memento with memento embed, they can summarize all mementos together and publish the story with rain tail. And so here are some examples of the DSA toolkit at work. We automatically summarize public archive at collections. And this is where many of our experiments lay. So we can move from things like this on the left. An archive at collection about the novel coronavirus about COVID-19 that consists of more than 23,000 mementos. And we run it through the process. And then we produce this a sample of 36 mementos visualized as cards phrases and images. On the left, we have the COVID-19 collection with the metadata that's been painstakingly provided by the archivist. On the right, we have people in masks. We have pictures of the virus. We have maps showing the virus is spread across the world. All drawn from the content that was already in this collection. And if the user wants to explore this collection further we of course link back to it so that the user can look at this summary and then search through the archive at collection for more content and explore more. And with the DSA toolkit we can compare archive at collections about mass shootings, such as the one at Virginia Tech and compare it with the one at Norway or El Paso. And with these summaries, we get pictures of the victims, the perpetrators, the places. And of course the sources that these documents come from the sources that exist in the underlying collection. And we have links to the underlying collection that allow people to get back and see that content and explore it further. But the DSA toolkit is not just for archive at collections. We're very concerned about interoperability. And the DSA toolkit works with memento compliant archives. We can do other things, like we can automatically produce daily news summaries. We can extract news articles with tools like story graph, feed it through the process, and then produce this. A summary pulled from the mementos that were created that day. Of the biggest news story of the day. In this particular case, it was the death of civil rights and voting rights activist John Lewis. That's what we see the people that surrounded him, the people that were concerned about him, the people that may follow his legacy. We see the sources that these articles come from. And of course, it's all available at archive.org. Linking back to the original content. And this allows us to review news from the same day in different years on August 8 and 2017. This was consumed by nuclear provocation by North Korea. On August 8 in 2018, the biggest news story was about us primaries and special elections. And then one year later, we were reeling from the aftermath of shootings at El Paso and Dayton. And thus, someone could come back and explore this and see what people were talking about on a given day and get to the mementos and find more information and continue their research. We can also do other things like we can automatically summarize Wikipedia references. We can move from this Wikipedia page that has references that are mementos thanks to a partnership between the Internet Archive and Wikipedia. We run it through the process and produce this a story of a sample providing new insight into the references that were not seen in the article. And thus, someone gains additional insight and they can still get back to the Wikipedia article to read that as well. But they can see what these references are saying and follow those to their mementos as well. And we can summarize a scholar's great literature from the Web Archive collections of the scholarly orphans project previously presented at CNI. We can move from this 1000 mementos of great literature from the scholarly orphans project, feed it through the process and produce this 20 mementos representing a scholar's work. And so we can see what the scholar has been doing different places that they've put content and possibly colleagues and other information about what they've been doing while this was being recorded by the Web Archive. You can automatically summarize your Web Archive collection here because we had to make all these tools flexible for research projects. It was no work at all to make it so that you can add your logos, your branding, your content, and make your own stories. So for more information on the Dark and Stormy Archives project, we have a variety of websites. The main website provides an overview of the project and different research that was done and also links to these other things as well. We have example visualizations, the DSA Puddles website. Hypercane is used for intelligently sampling mementos. Raintail is used to tell stories with many mementos and memento embeds summarizes individual mementos. And so I encourage you to follow some of these links and please let us know what you think. And with that, I will open myself up for questions. I thank you for listening to me. Thank you, Sean. What an exciting project. Really fascinating and very hopeful for those of us who wish to be able to make use of these vast collections. Really terrific work. Thank you so much for sharing that with us here at CNI. And thank you also to our attendees for making time out of your day to join us here at the fall 2020 membership meeting. The floor is open for questions, so please do share questions or comments with us by using the Q&A tool. And while we're waiting for folks questions to come in, Sean, my question to you is how accessible are these tools? Do you need special skills to use them? I was looking at the website, which I chatted out there to our attendees. I see that all the tools are available there from the site, but is this something that an average person could pick up and start using or how does that work? Well, Memento and Bed is relatively straightforward. We have a demo website where one can just put in a URL for Memento or URIM, and it will just generate a card or whatever circuit type that you choose for it. So this is the Rainetail and Hypercane or command line applications, and this was largely because we were primarily looking to allow archives to incorporate this into their own workflows, and that seemed like the best way. It was also easier for us, right? Because writing command line applications is much simpler than creating GUIs. So that's where they are right now. We are, of course, open to funding and other things to try to improve this stuff. But for right now, yes, they are command line applications, and I'm updating documentation or providing documentation for Hypercane. I have pretty good or pretty extensive, I don't know, you can tell me if it's good or not, documentation on how to use Rainetail. There is some complexity there. Rainetail is definitely the easier to use, but it's also older. But a user interface for an end would be nice. We just haven't gotten that far. Okay, got it. Thank you very much. Really interesting. And so I guess that is that part of the future of future plans here for the tools? Yeah, definitely. Like I said, the short path that I've been taking has been trying to make these useful for our research, and of course for my dissertation. But yes, the future for these tools is to provide user interfaces. We also want to know what works best. We've been summarizing collections like these archived collections with, this is, this was an algorithm developed by a former ODU student, Yasmin Alnomani. And she created this algorithm and we implemented it in the Hypercane, and you can easily execute it from the command line on an archived collection. You can also execute it on other collections, for which you have lists of eventos. We're exploring other algorithms that may work even better for certain types of collections. And one of the things that we had discovered in some other research was that there are different types of archived collections. And so this, this is probably the case with other tools as well, where the majority of people are archiving their own content. So we're creating collections about the El Paso shooting. Only 4.2% of collections are about things like the El Paso shooting, or Japanese earthquake or something like that. So, we're exploring with Hypercane additional algorithms that could be useful to people. But we, we already have some in there for people to just use today. Great. Interesting. Thanks. Cliff has a question now. So I'm going to hand it over to Cliff. All right. So this is really fascinating and I was very interested when you pointed out that these were command line applications so that they could be integrated into the workflows involved in capturing and curating web archives. And I totally understand that given the scale of the activities that take place here. What I'm trying to get a handle on is that it looks to me like these do sort of automatic story extraction from a given archive. And I can easily envision scenarios where a curator will want to sort of steer that automatic extraction in various ways. Do you have any thoughts on, on that, how realistic is that scenario and to what extent is that readily accommodated here. Yes, yes. So, where rain tail allows you to customize how things will look in the end. Hypercane does have complexity that allows an archivist anyone who's creating a story to customize the selection. You don't have to use our moneys algorithm. You can, you can do a simple search with, you know, with with text matching say say you want everything that contains Barack Obama in the Boston marathon bombing collection. You can do that step first and then feed those mementos through other steps in the process. And it's intentionally complex because it allows one to do all of these different things, pre processing, removing duplicates, and all of these different steps, finding off topic items, you might want to find off topic items in the collection maybe that's interesting. All of these things are things that hypercane can do. It's just, you have to know what, what steps to run, and you can run them in whatever order you want. We also made sure that we made it so that hypercane can incorporate its own output into other steps. So, hypercane mostly works with lists of your IMS. So if you run through one step, you get another list of your IMS that you then run through another step, and you can reduce it or expand it or whatever depending on what that step needs. So yes, there is a high level of customization. Okay, thanks for clarifying that for me. Appreciate it. Sean, I just wanted to let you know. Michael Nelson just chatted in that rain tail will also take a list of mementos manually created to Yes, yes, that's true. And part of that's because hypercane outputs a list of mementos. So the, yeah, I didn't mean to overlook that but yeah that that is that's how rain tail primarily works. This is Michael yet to customize rain tail even further. Yeah to cliff to point. I was just going to say to cliff's point if, if an archivist or somebody said you know these are the, the 3050 whatever pages, I want to make sure show up in the story. Rain tail would just take that as an output, or excuse me as an input. Sean, and to further back further kind of stress that the first step here of selecting pages for the story. You see you have automatically with hypercane, or by people. So if there's some mementos that you really really want to include just to throw them in the list. I'll also point out. Yasmeen's work from a couple years ago, did prove that people comparing manually generated summaries with automatically generated summaries. People could not tell the difference. So essentially the algorithm does as good as the domain experts and down sampling from, you know, thousands of pages to, you know, 25 or whatever the number was. They showed it to human reviewers, they could not distinguish the computer generated summaries from the human generated summaries, but they could distinguish them from random or you know other generated stories as well. Thank you. Thanks Sean thanks Michael for adding a little bit of context to that. And my apologies Sean that I didn't give you a heads up that I had unmuted Michael there so he just sort of popped in out of nowhere. But I do see that we're past the hour now and I want to be mindful of that. I appreciate this spirited discussion and I want to invite any attendees who have the time to stick around to please do so. I will turn off the recording now and we'll end the public portion of this presentation but if you have more questions for Sean. Thank you again Sean for a really wonderful talk fascinating topic and thanks again to all of our attendees we look forward to seeing you again at CNI fall meeting. Take care everyone.