 I hope everybody sees me here, like behind this. And I'm Yasmine Nomeni, a postdoc fellow at the University of California, Berkeley. And today I'm going to talk about how to combine a story telling with web archives. So hopefully we can generate something cool, like the theme of the conference or the spirit of the conference. And out of, you know, like this pouring web archives. So I'm going to start with a story here. So if you recall at January 2011, something big happened in Egypt. It was the Egyptian Revolution. And at this time I was here in the USA starting my PhD. And, but I was, you know, like able to follow everything that was happening at this time because everything is started on Facebook and social media and the web. So I was able to witness everything like was happening there at this time. And my son, Yusuf who's in this picture, he was too at this time. And apparently he had a, you know, like very short memory and he wouldn't like remember anything of what was happening at that time. So I started to worry like how he gonna know what was happening as I'm living it now. I wasn't in Egypt and I was like very sad that I wasn't there but I was able to really see everything all the stores and from, you know, like all the sides because we have over there like we have like different sides like what's happening here in the election. So I won't explain this very much now. You understand what I'm talking about. So I started to think like how can I help him? And, you know, like when he and his generation like to know what was happening when they grow up. So honestly I started myself like generating some initiatives for, you know, like archiving some stuff about the Egyptian revolution. And also I found that there are other initiatives for documenting the Egyptian revolution. Okay, so I'm fine with that. And several studies and books like appeared afterward and cited these repositories. But these repositories were gone. Like the whole repositories of like videos, images of what was happening in the real square and everything in Egypt. Like they are totally unavailable now on the live web. So how many of you have heard about the internet archive? Okay, great. And archive it. Okay, so for those who don't know what's archived it's subscription service from the internet archive that allows people to generate archived collections that follow specific theme or about specific events. For example, there is a collection at archive it about the Egyptian revolution. And luckily I found that some of these repositories that were gone from the live web they are archived on archived. So yeah, archived collections are important for the future generations. But I know that some of you told me like I have heard this like using archived is not easy or using web archives is not easy. And actually I have done a study on this and I discovered that the people they don't go and navigate web archives easily. So yeah, suppose that youth of after seniors he would know about archived and he would go there and try to type some terms like Egypt revolution to look for something about the Egyptian revolution. So this is the interface of archived and here I just have list of collections. And okay, he will be presented with different collections about the Egyptian revolution. So there are like three or four collections about the Egyptian revolution. So I'm going to pick one of them that I see that this is the most relevant one based on the description. And by the way, some of these collections they even don't have any descriptions on what inside these collections. So I picked one of them and then I picked the first URI. Okay, suppose I'm using and after many years I wasn't like there during the Egyptian revolution. So I'm going here just to understand what was happening and I clicked on a page and then this page, okay, I am presented now with another list of like dates. So these are different cubbies for the URI I choose. And suppose I'm going to like click anyone randomly and then I'll be presented with this page. So for understanding what's inside this collection I have to go through all of these pages actually to understand what's inside this collection. So understanding the content of these collections is not easy for regular users. So we tried, you know, like after I joined my PhD I tried first to like apply the conventional methods, visualization methods on these collections but it was really hard to, you know, like when the problem comes to like thousands of seed URIs and then thousands of cubbies for these URIs and you want to visualize them all of them together to understand what's inside the collection, it's really hard. So we have to think about some other ways to understand the collection. So we thought about storytelling and here I'm not talking about like stories as it means in the literature. So I'm talking about it like as it's loose, like context and the social media as it's used. And actually I have here like stories here for all the previous talks. So this is what I'm talking about. So storytelling become like very popular in social media and you know, it's have been like used now because of the sheer volume of information in the web. So for example, like Facebook look back, they are able like to summarize your like a year or something in just like one minute video and there are, you know, like many services that allow you to summarize and select and pick and choose representative videos or tweets to generate narratives or stories. But these storytelling have limitations. So let me show you. So this is an Egyptian revolution story on a platform called the StoryFi. And StoryFi is one of the most popular storytelling services. It allows people, sorry can I get the water? It allow people to generate narratives like out of social media content mainly or any web pages. I think most of the journalists use the StoryFi right now. So because it's very easy, framework allow people to drag and drop resources and generate like big narratives about specific topic. So this is, suppose this is a story that I'm interested in learning about and I'm like browsing the story and then I came to one of the URIs in this and I really wanna know more about this URI and then I go to this URI and I get 404 and I'm sure that all of us like get this most of the time. So yeah, this is the problem of storytelling. So the resources are not persistent and web archives are not easy to use. So we thought about like, okay, what about combining them to generate like stories but persistent stories? So we're gonna use like storytelling services, thank you. So, you know, for visualizing the stories that we will generate from archived collections. So this is an example for what's the output we targeted. So I went to StoryFi, sorry, I went to archived collections, Egyptian revolution and I knew this collection very well because I have been studying this collection since it was started in 2011 and it's running till now. So I went and I tried to select specific pages that may tell the story of the 14 days of the Egyptian revolution. So hopefully my son can find something when he grow. So okay, I took hours and hours for generating this, like to go through all day or eyes and even I have background of the topic and I know the collection very well but it took me hours to like select the best representative pages that represent the story. So how can we generate this automatically? This is the main thing of my topic here. So before I take you to the steps that how we generated this automatically, let me first like start to show you the types of the stories that can be generated from archived collections. And I know that there are many journalists here today which I haven't expect but I think this would be interesting for you. So the archived collection has two dimensions. So it has list of URIs and URIs has like each URI has different times. So URIs and times they can be fixed or sliding and based on this we came up with four types of stories that can be generated. So the first type of story is fixed page at fixed time. This is like if I'm, I wanna like, now like I requested CNN from my desktop, my browser on my desktop machine, it will give me different representation for the page. If, you know, then if I requested this, the same page from my mobile. So, you know, different representations for the same page at the same time. So this can be having like also if I'm like, if I'm here in the States and I type Google.com, it gives me different representation other than, you know, like if I'm in the Middle East. So unfortunately this is not supported right now by web archiving. So we can't really generate this type of story. The second one is like when you try to fix the page and slice the time. So for example here, if you wanna know everything about the Egyptian revolution, the key events, but from a specific website. So I don't, I wanna know everything as it appears on the BBC. So I trust BBC more than Fox News, for example. So I wanna know everything from the BBC or I wanna see how a specific page like my personal homepage, how it evolved over time, like over the years. So here I like a specified specific page and then it gives me like the same page at different times. So the third type here is when I slice the page and fix the time. And this is very interesting for human citizens, you know, historians, because here there are many studies actually, they went manually through the newspapers and try to like, you know, compare between the opinions of different newspapers at different, you know, like times of the Egyptian revolution. So for example, how the newspapers react to Mubarak resigning. So, you know, like the Pro-Mubarak newspapers actually, yeah, there are many cases studies that showing the importance of this story here and this is really important for journalists to compare between different newspaper and their coverage about a specific event. So the first one is the sliding page, a sliding time. And here I just want to prove this to coverage possible for a specific event. So this is the framework that we proposed for generating these four, three types of stories until they archive, like, you know, support the first one. So as a framework, it has three main components. So first we establish a baseline of what human-generated stories would look like and then we reduce the candidate pool of archived collections, sorry, of archived page and then we select best representative pages from these collections. So I'll walk you through some of these steps very quickly because there are much detail and here I'm citing like all the works that have all the details about this so you can like go and read them. So first establishing a baseline of social media stories. So we grabbed like thousands of stories from StoryFi and we defined the popular ones based on the views, you know, over the time as they have been on the web. And then we measured like the lengths of these stories, like what the type of the resources in these stories to get a template of, you know, how do people generate the stories on social media. So what we got from this or the summary here is that 28 pages for the story is a good number so most of the popular stories on StoryFi like are around the 28 and people tend to have like more images, you know, on their like the resources they generate or they collect on StoryFi. So the second part here is to reduce the candidate, you know, like pool of archived pages that we have and first we started by detecting the off-topic pages because the collection have off-topic pages and I'm going to show you this. This is one of the pages for one of the most, you know, like famous figures at the time of the Egyptian Revolution. He was in the presidential candidate and this page is one of the pages of the Egyptian Revolution collection at archived. So this page has like thousands of copies through time and as you see here that this page can, it has, you know, like many copies that are off-topic for many reasons, like for example, it went off-topic because of database error, it went off-topic because financial problems, this is the text in Arabic and believe me, I'm like translating it correctly here and also, and then it went on-topic again and then it came back off-topic because of like hacking or because of the domains loss, so there are actually many reasons for a base to become off-topic. Archived provides their partners with tool to, you know, like specify the frequency and the depth of crawling the pages, but there is no like control or there is no like notification when these pages goes off-topic. So we proposed or we evaluated six methods to discover these off-topic pages and, you know, like execute them away from the collection because I'm sure that if anyone like want to see a story, they wouldn't like be happy to be presented to any of these pages. So we execute them and just select the on-topic pages and yeah, there are a lot of details here that you can go, you know, like to my publication and read more about it and I'd be happy to talk about it later. And also, as the page happens that it can be, it can have like different or sorry, many duplicates through time because the page, as a content of the page, you know, the frequency of the crawl can be like weekly or daily or every few minutes and the page may not change at this time. So it happens that there can be a duplicate. So we also remove the duplicates and then we select the best representative pages and we specified the best representative based on a quality matrix here. So we came up with different quality variables and don't look at the equation. So don't worry about it. You don't have like to go deeply with the equation. I'm going, you know, this is much better. So this is one of the metrics here. It's the quality of the page itself. So some pages when the people or when the pages are being crawled on internet archive, many of them actually, they are like the one on the right and I wanna ask you like which one do you think it's best or it's missing resources? The one on the right and the one on the left. So the one on the right? No, okay. So the one on the left, okay. And the one like not either anyway, okay. So yeah, you can see this by yourself. Like the one on the left is much better than the one on the right. And this can cause a problem when we present this page and put it on, you know, any visualization techniques because the missing resources can be an image. And if you wanna extract this image, it would, you know, it wouldn't show up. So based on like different techniques we used, we detected like we actually used the techniques for calculating the damage of the page. And based on this, we give a weight for, higher weight for the one on the left. Also we know that the people love visual things. So the people prefer like images and this stuff and they would prefer like better snippet. Like when the page you put it on, we choose actually story fight to visualize our stories. So we tested the pages like and choose the ones that gives us better snippet on a story fight. For example, the one, the CNN page, because it's D, B, or I, it gives us better snippet. It succeeded in extracting the image than the other page. So visualizing the stories on the story fight. For this, we have extracted the metadata of the pages. And we use the story fight API. It's amazing and very easy to use. And we override their like extract the fav icons. Like, you know, do the extracts automatically. And also we put the date of each page because this is also not extracted automatically. So this is an example of automatically generated the story on a story fight. So evaluation, this, you know, this is a research and academia. You have to evaluate everything you're doing to be able to publish. So you can just like, okay, I generated these stories. So this was very hard for us to evaluate this. And, but okay, we thought about the touring, you know, touring this. So if these stories that we are generated automatically are indistinguishable from what human can generate, I think we would say that we succeeded. And if like human and our generated the story are better than the random generated the story, so this is actually the things that we were targeting. Okay, so we used like human evaluators and we used actually the help of experts from internet archive and their partners. So we asked them to generate the stories out of the collections they already know. And we give them the criteria of generating, you know, these collections. And actually I went there and I showed them how to generate the stories. And, you know, also based on the template we got from the stories we started at story five. And then we were able to get like 23 stories. And then we evaluated them based on like 1,000 comparison from like a human. And this is, oh, sorry. This is how we, you know, like presented the stories like two, like for example, the one in the left. Actually I would like make a small test here. So who, which one of them like A or B do you think that generated by humans? And which one, okay, by humans, A, raise your hand. Okay, B by human raise your hand. Okay, neither of them, because I see some people like didn't vote. Okay, so actually the one that generated story A is generated automatically and story B is generated by human. So this is the result of the evaluation. So it was like 50, 50. And this is actually the result I have seen here. It's almost 50, 50. So, and also it was great to see the automatic and the human are much better than the random stories. So hopefully this is Youssef now. So hopefully this is something that he would use when he grow up and after many years he would like thank me for generating this. And, you know, he wouldn't struggle to understand what happened during the Egyptian revolution because yeah, so that's it. So all, we have the code and the papers and slides. I have like put all together in this blog post and it has all the links for the code and the data sets that we have. So, and I can like have questions at this time.