 Hi, this is Edward McCain. Welcome to our presentation about Endangered but Not Too Late, The State of Born Digital News Preservation. I'm part of a research team that completed this report. We released it in April. So I work at the University of Missouri, kind of in a dual position between the Donald W. Reynolds Journal of the Institute and University of Missouri Libraries. My purpose, one of my main goals is to preserve born digital news content, not just the digitized but stuff that was originally created in digital formats and that has not been until lately. I don't think a lot of attention has been given to it. So that's me. I also have another member of our research team with us today, Neil Mara. Neil, you want to tell us a little bit about how you got mixed up in this research? Sure thing, Edward. And hi, everybody. My name is Neil Mara. I'm a journalist of 40 years in the industry. I've been a consultant and researcher on this project as well as a fellow at the Reynolds Journalism Institute. But my background includes a lot of experience with digital and print, publishing workflows, technologies, asset management, archiving and preservation. Much of that, most of that was with McClatchy. I led a group that integrated digital workflows for McClatchy's 31 newsrooms. And in the process, I took on the task of preserving the content that hadn't been accumulated at all of those newsrooms. So that's really where I got a lot of experience. And then I got to meet Edward and got involved in this project. And we're really glad that you did, Neil. So all right, we're going to get on with it. First of all, identifying what's the problem? What's going on? What is the state of our digital news preservation? The problem is this. When we shifted from digital, from analog to digital, the news industry was focused on many other things and didn't really address these issues that happen in terms of digital preservation. So we say it forgot to ensure that. I think it was partly they forgot, but there's also a lot of other issues that we'll identify and talk about in the report. So you can probably write off the bat if you know anything about technology. You know that disk drives are physical objects. They don't last forever. Databases can be corrupted. There's a number of different ways that things can happen there. Backups don't always happen the way we had hoped that they would. Sometimes they don't happen. Sometimes they happen differently. Link rot is a huge problem and trying to access information over time on the web and metadata. Just having basic metadata, that over time can get lost, can break down and make it really, you know, pretty much impossible to find a needle in a haystack. Neil, anything you want to add there? Yeah, well, these things are issues that I think many people in this organization are familiar with. We're talking here about the news media, but this really is because of its importance and we'll talk about that in a minute, but this really is a wider problem. A problem that's really society-wide that we've just begun to grapple with. But our focus here is on news as part of the public record. Right, right there. Vicki McCarger in one of her reports in back in the early 2000s said this, and I really like it, digital preservation is like life support. Once you start trying to preserve something digitally, you really can't let up. You have to keep it monitored. You have to keep migrating it to formats that you can access. It's a very long-term and intensive activity, much more so than I would say the analog process, analog preservation process. So as Neil was talking about earlier, we think that it's really important that news content gets preserved, digital news content, which is the primary form these days. It's part of our democracy. We call it the first draft of history often. It's an accountability, fundamental accountability mechanism for government, for large institutions, for corporations, and so on. It's an antidote to fake news, which erodes trust in not only in the news industry, but in other institutions as well. It's society as a whole. And it's a primary source of information for journalists, certainly, but also in the scholarly record for scholars as they try to pull up and discern what's been happening in our society over time. There are commercial uses as well. It's valuable for a variety of different purposes in terms of revenue streams. And it's part of the public record. So this is our community history often. The best source of that for a particular community can be in the news record, in the news content of that area. Neil? I just want to emphasize that last point, Edward. The record of the past and what's happened as reported in local news organization channels is really unique in every community. And that's why we think it's so important that this news be preserved and that we figure out new ways to preserve content now that it's all moving in the digital direction. So briefly, and people are probably pretty familiar in the analog world, archive was often basically grabbing a copy, keeping it cool, dry, dark, keeping bugs away from it. The timeline that we were looking at was in decades or perhaps in some cases, centuries. The idea of benign neglect, although not optimal, often was, I would say, tolerated or accepted as part of the deal for archiving content. And this goes for news content as well. So for instance, preserving a newspaper or reel of film or some audio tape, there was time, it wasn't necessarily urgent that every year or every decade even that get refreshed. And the long term access is a big thing. You can open a newspaper from 100 years ago, and you can access that content. I don't think that's, we're not going to be able to say the same for digital content. Even in a couple of decades sometime or a decade or or a couple of decades, we can lose access pretty darn quick to digital content. I think that's a really good point there, Edward. And in the case of just the image on this slide, what we're seeing here is a picture of a film can at GBH in Boston when we visited. And as everybody says, you could open up that can, you can pull up a frame on that on that film and you can see what was in that film. You do the equivalent today with a disk drive. It's completely indecipherable. And it's entirely dependent on the software that reads that data. Right. So in terms of digital, I'm going to switch over and let Neil take the lead on on these slides. How digital is different? So this is just some examples of things that we know and that we've learned through our research. Digital is different in so many different ways. And it's very fundamental nature. But also in the way that it's accessed. Subscriptions are no longer ownership. It's really just a license to access. You don't actually own something and hold something physically. The new cycle has changed entirely from the past. It's 24 seven operation news is constantly being updated. There are no distinct additions. There's no sort of demarcation lines as there used to be in the print world and broadcast world where broadcast was done. There's so many channels and in the digital environment and they're just proliferating and they're going to continue to proliferate. We expect making this so much more complex. The very kind of you might say atomic unit or atomic element of news is changing. It's becoming much more granular than the article or the page or the newspaper or the broadcast that was the case in the past. And that's what makes the linkages between elements so critical. We now have many new digital news startups. This is a great trend these days in the news industry. New ways of delivering news, new kinds of organizations, some of them nonprofit. But what we're seeing is the process of creating news in many of these organizations really has no, is not taking into account the preservation in some ways is making the problem worse. And then we have issues with the memory institutions which were the backstop and in many ways the kinds of organizations that helped the industry in the past preserve its content. But that is really no longer as reliable because of the issues of resources both in the memory institution field as well as in newsrooms. Yeah, I was going to say we were kind of surprised by the fact that the digital startups were not preserving as well as the older more established organizations. But we'll get into that a little bit later. That could largely be a part of culture. Yeah. Like you say, it was not what we expected. Right. Well, go back to the study to give a little more background on that. So this was again the Reynolds Journalism Institute, University of Missouri Journalism School and libraries, the university libraries here. We received a generous grant from the Andrew W. Mellon Foundation that allowed us to do this research. And just a little bit of background, we had a significant loss of news content back in 2002, which kind of got this whole thing started and got people at the journalism school very upset because there were about 15 years of Mid-Missouri news content, cultural history that was digital and digital formats that was lost back then. But it got people to thinking and said, we really probably need to address this. We need to raise the profile of this issue because this is going to happen again. So in the study itself, we wanted to look at the news industry. We got into a lot of newsrooms. We talked to a total, well, probably more than 115, but around that number, 40 some organizations. At the beginning, this was pre-COVID, we were visiting newsrooms, talking to people, seeing how the organizations were functioning, various newsrooms around the country. And then COVID hit, so we had to switch gears and did a lot of the rest of the research by video conference and that sort of thing. We also, I should also mention that we talked with some memory institutions that are involved in preserving born digital news or news content, including born digital. And then we talked with some of the firms, the technology firms, that produce the technology that allows the content management systems and other parts of news technology to function, which was really helpful. And they're very interested in what's going on as well. So in total, we spent about 18 months from 2019 to 2021 researching this and pulling together this report. And I would just emphasize the piece that Edward mentioned that quite a bit of this information we gathered was gathered on site at news organizations, newsrooms, along with memory institutions. And that was really extremely valuable. And we thank, we really thank the people who were involved. They're busy people, they had a lot going on, but they're also interested in the long-term survival of the products that they're producing, the news content. So what we found, we found that there was an interest, as we said, in the news organizations, all of them wanted to do something to preserve news content. They often didn't know what it would do, what to do or how to do it. And as a result, there's some big chunks of news content that are still today are not being preserved. And this is across, as we talked about, there's so many digital channels and that's expanding, but websites, mobile apps, certainly images, video, audio, interactive elements, special projects, and then social media, which social media is a really huge problem. Yeah. In fact, social media, I think is a good example of what we found. News organizations rely on social media to get the news out. Many readers and consumers rely on social media to receive news and to direct them to news. But of all the news organizations we spoke with and interviewed and met with, only one was doing anything significant to preserve their social media content. So, and then in terms of our findings, we had a lot of information to convey a lot of findings from our research. But we broke it down essentially into these four different areas, content, tech, practices and mission. And Neil, would you elaborate a little more on what we found in these areas? Sure thing, Edward. So in the content area, one of the key things that we found, this was again kind of a surprise, was that public media organizations that we met with and interviewed were doing a better job in general than private news organizations. Although, of course, private news organizations far outnumber the public organizations. We, and here we're talking about in public media, we're talking about, say, NPR stations, PBS stations, those kinds of enterprises. In technology, we found a couple of very, very troubling trends. One of them is that because of the financial difficulties news organizations are facing these days, just to stay afloat, especially in the old, the former newspaper industry, we're seeing cutbacks across the board. You know, not just reporters and editors, but in the support areas, such as technology staff and systems. We saw a number of cases where systems used for preservation are simply no longer there because an organization couldn't afford to keep them going or the staff to run them. In addition, we saw this largely and the absence of the former news librarians that we used to see in news organizations across America. And that was a very troubling trend. And we'll talk more about that soon. And as far as practices go, what we saw was almost no matter what kind of organization you're talking about, there was an overreliance on the web CMS as a platform for preservation, kind of an assumption that the web CMS was sufficient to preserve news content. When in fact, as many of you know, web CMS is not designed for preservation, it's designed for delivery on the web is designed for speed is designed, it's designed for performance. And it does that by reducing the quality of content, especially visual content, the size of images, the quality of images, videos, sound, etc. So this is a big issue. As far as mission, we found that there were very few news organizations that actually had any kind of policies or written practices concerning preservation. And this is one of the clear distinctions that we found, because many of the public media organizations did in fact have clear policies. And so that that links back to that first one, the public media doing a better job. And we think a significant reason for that are these preservation policies. And I just want to also reiterate what Neil was talking about with the CMS systems, the content management systems, over and over again, we would hear that is the archive. We were like, well, that's not really designed to do that. So challenges, Neil, you can go ahead. So we wanted to dive a little deeper into some of the preservation challenges that emerged through this research. And so many of them actually stem from this first point, that the digital, the transition to digital for the news industry has meant this multiplicity of products, multiplicity of channels, kind of an explosion, you might say, of outlets and streams of various kinds from the web to mobile to social to all sorts of news feeds. I mean, for heaven's sake, you can even see news on the screen on your gas pump. So this is one of the major factors that makes technology so much more central to the issue and to the process. And the changes in the news, the structure of news itself, far more important. And so what we found is that this whole question of the links between news elements is much more critical than it ever was in the past. When you had a single product and everything was assembled, and you could see an entire package of an article or a page or an edition or a broadcast, then it was a finite thing. But that's not the case anymore. Now it's the things that connect the photo to the story on the web or in the mobile app or in the news stream, connect the video to the story. Those things become absolutely essential. And we found, in most cases, those connecting links are not being preserved. So the news cycle, of course, we talked about this, the lack of distinct traditions makes it very difficult to standardize on preservation processes. And then the loss of newsroom librarians. What we found, this was really kind of a striking finding that in many cases, especially in the newspaper industry and other types of news organizations, there used to be news librarians everywhere. These were the experts. These were the preservation people. These were the ones who cared about this and were there in the rooms when decisions were happened. And in many cases, they're not there anymore. And I'll give you one example, McClatchy, the company that I worked with for many years, where there used to be a handful up to several dozen librarians in some of the larger newsrooms in Miami or Sacramento or Kansas City, the entire organization now across all 31 newsrooms, there's only one news librarian at the Miami Herald. And these are all the results of the struggles in the industry. And so we're just seeing impacts left and right from the struggle of the news industry and the change in business models. So again, diving deeper into this, we see this proliferation of news content types. And each of these has unique challenges in the structure in technology, publishing systems and broadcast systems. And those have to be taken into account for preservation purposes. And one of the things that we saw as a way to deal with this is an expansion in the number of systems that you see as part of the publishing and broadcasting process. And so again, a proliferation of systems, often systems that are channel specific to one particular channel, a video channel, audio channel, podcasts, for example, not just the web and mobile apps. And this fragmentation has made the problem much more difficult because you have different things going on in different systems, all drawing from, say, an original content source. This point about the deemphasis of print is really critical. And it's something that we want to draw attention to because many of the practices of the past that we still see in place rely on print additions for newspaper content as the archive method of record. And what's happened in the digital era is that there's been this gap that's just continued to grow between the amount of content that is published digitally and the amount of content that makes it into print. That can be, in many cases, the best news organizations that still have healthy print products, maybe 70, 80 percent. But in others, it can be as little as 40 or 50 percent of what's published digitally. So that has a big impact on processes that rely on print. What it means is, for example, the interest and the focus on the e-edition version of print products really means if that's something that's being used as a preservation mechanism, it's going to miss a lot of content that was never published in that print product. So these things are very important. We should probably clarify that this e-edition is a, it's kind of like a PDF or it's a variation of a PDF. And that PDF is derived from the print edition, not from what's online, not what's on the web. So if you're, in many cases, like a Library of Congress and other memory institutions, what's being preserved is the PDF of the print or some version of a PDF of a print edition, which is not a complete record of what was really, you know, the news organization was publishing. It's just a portion of it. Sorry. No, no good point, Edward. One other thing we want to talk about here is this last bullet on system migrations. Because of the complexities that we're touching on here in workflows and systems and metadata, what we saw in talking with newsrooms is that many of them report having significant issues with the loss of content during migrations from one system to another. And in this time of upheaval with, you know, growing numbers of channels and systems, this kind of migration problem is proliferates. It's happening everywhere and translating content from one system, you know, comprised of every structures and databases to another often results in the loss of content, the loss of parts of the content, the loss of metadata. It's a major kind of red flag for news organizations to be careful in those transition periods. Yeah. I would just want to one example of that, you know, content losses. We saw a situation where systems were being migrated and they actually at a certain point, like 2008 or something like that said, any publication date prior to that, we're just going to ignore. Well, if you don't know when the when this was published, it really hampers the, you know, it really diminishes the information quality. You know, it's really makes it difficult to figure out when was this published? When was it created? Right. So recommendations. And we had three general areas. We had the the first one was immediate actions. We tried to to get things together, pull things, some ideas together for how news organizations could do this without spending a lot of money or a lot of time that can just jump right into it. So Neil, you can elaborate on how at least some of what we suggest for immediate actions. Right. So here we're talking about preservation policy is really the number one item, because it correlates so closely with news organizations that are doing well. And so we see it as a very important thing because of the uncertainty and the and the, you know, the lack of the same news librarian role in many newsrooms. It's more important than ever to have a policy that states what you're going to preserve and how that's how that's going to be done. And one way to get there is to tap somebody to be responsible, you know, with the loss of librarians in news organizations has got to be somebody paying attention. And we found a number of cases where there is no longer anyone paying attention. So that's a key recommendation that we thought was important. And then establishing a plan for unpublishing. This is a new area that's a result of the digital, you know, proliferation across society, where many people are concerned about privacy and hiring reputation management companies to pressure newsrooms to remove content that might be embarrassing. Well, many news organizations really aren't aware that these kinds of these kinds of things are happening because they might hit a low level person and somebody will know sure or go ahead and delete it. And so a lot of newsrooms really don't have established processes for this. And so it's very important because it affects that that actual public record that this be done intentionally and with clear policies and procedures. Yeah, that the unpublishing thing is really gaining more and more attention in the news industry. I'll say that it's really important to look at. So our our second set of recommendations were set of mid medium term, so we say actions that will they'll take longer to accomplish. They may need to be invested and they may take some money or pull in some tech staff or something. But they're really essential to making progress towards digital preservation. And Neil, you want to get the details on that? Yes, thanks, Edward. So where where the the previous slide and recommendations were kind of, you know, minimum immediate steps. Here we're talking about things that are much more fundamental and permanent. You've got to have control of your metadata. So we're recommending the news organizations really dive deep into just review it because in many cases it's it can be changed and modified and processes can can be configured to preserve more metadata if somebody is just working the process and paying attention. Clarifying ownership is very important because as we look to the future in this continued proliferation of channels, it's very important to understand and, you know, which content you can actually you actually own so that you can know what to reuse, what you're able to reuse. And this is more difficult than it may seem. News organizations content comes from all sorts of places. It comes from your own staff members, reporters, photographers, videographers. It comes from freelancers. It comes from wire services, from syndication services, from the general public, from institutions and businesses, all kinds of sources. And you've got to know what the ownership and the rights are on all that content. So we suggest also that news organizations run a self assessment without somebody paying attention. It may not be clear that these things are going on. So we're recommending that leadership in newsrooms just run a little test, go back a year, pick the biggest news story of the previous year, and see if you can get a hold of the original content elements for all of the news coverage that you published on that big story. The original photos, original videos, original materials, the full length of text for all channels. Can you actually put your hands on that? That's what this is about. Put your hands on it and, as you were saying, make sure that you have the rights to use it. That you have the rights to it, exactly. And then this last point is really kind of the most important sort of forward-looking step. We really found that in order to do this properly, news organizations have to have some sort of system dedicated to preservation. The reliance again on WebCMS is really, it just isn't going to work because WebCMS has a separate purpose. Its speed of delivery the waste content is structured for the web is not the way it's needed for preservation. So an asset management of some sort is really going to be essential if you're looking to preserve content permanently going forward in the digital era. Essential, exactly. So talking about asset management, some best practices. Neil, you want to jump in on this again? Right. So we thought we'd share some of the best practices we saw in asset management because again, this is such a critical area. So I'll just jump right in. We see it's very important that you pay attention to workflows and build workflows in a way that reduces or if possible eliminates duplication of content. This is a big issue with WebCMS because people are taking the same photo and throwing it into the WebCMS today and then when the story runs again tomorrow and the story runs next week and a month from now, that same, let's say crime story or murder or election or whatever, you've got numerous duplicates of content. Which one's the original? In many cases, nobody knows. So you've got to set up a workflow that avoids that and it's really possible to do. You want as a best practice to preserve the full text in all channel manifestations. You want to see that those differences because it might content may need to be modified for a certain channel. Mobile, for example, social media, you want to preserve all that. For visual information, it's really important to have systems that allow you to preserve the highest quality of your still images, video, graphics and active media. There's so many more kinds of active media these days, maps and interactives and so on. Those are very difficult to preserve and those are going to continue to grow, but keeping these kinds of content permanently at their highest resolution and their lowest compression levels is really becoming critically important. So the linkages we've already talked about, if you preserve the linkages between content and the packaging information on what went on the web and in mobile and so on, and you have that original content, you can recreate something close to the way that original content is published in each channel. And that's really the goal, we think, for proper digital preservation. And then lastly, we think it's important that the preservation be integrated with the publishing workflows, rather than as a process after the fact. Because the higher the bar to making preservation happen, the less likely it's going to happen with busy staffs and very low-resourced newsrooms and preservation groups. Yeah, we think that this should be integrated in sort of an automatic thing so that people don't even have to really think about it. They don't have to be doing it themselves individually. Or manually, right. Or manually, yeah. Additionally, some of these may seem obvious, but you've got to have really the best possible metadata. So we wanted to provide some detail on that. So not only do you want the origin of your content, you want to know where it came from, the authorship, the institution, you want to have the full rights information and licensing terms. But you've also, we think, need multiple kinds of descriptive metadata that allows you to identify content by what's in the content and not necessarily in ways that are specific to a particular channel. For example, not necessarily the web navigation, which is often associated with news content, because that changes. It changes so often. But you want that descriptive metadata that's going to be permanent. And there are standards out there that can be applied. And the use of multiple taxonomies is actually a good idea because those may be needed and applicable for different purposes in the future. And then, of course, change tracking. And here we're talking about post-publication, not the version management as things are being created, but what happens to content after it's committed to a repository or an asset management system. It's critical that this be tracked so that everybody knows what happened whenever content is changed, including these unpublishing processes. It may be okay to change something, but you want to be able to know who did it, when it was done, why it was done, and what exactly was changed. Yeah, and I'm dealing on unpublishing. It would be best to be able to go back and see everything that happened and not have certain things deleted from the system entirely. You want to be able to go back and see, oh, this is what we did. This is who we removed, this name or this was the charge that brought this up. So in terms of workflow. So one last set of details on best practices. What we saw in asset management and the best news organizations in terms of what they do for preservation are these kinds of things. What we saw tends to point toward a modified or improved architecture that is set up to ensure that content gets preserved. Again, as we talked about and Edward mentioned, as automatically as possible, regardless of where content originates, what device content originates in or what system content originates in, because that's going to continue to expand and to grow. But you want a process that at a certain point takes all that content and preserves it as a master copy in your repository or your asset management system. And if you do that properly and if you preserve not only the content at its highest resolution and its best quality, but also the links that allow you to package that the way it appeared in each channel, in other words, focus on the core content, then you don't necessarily need to preserve every web page. Because web pages, frankly, are changing all the time, the structure, the appearance, the organization, I mean websites change constantly, mobile apps change constantly. So what you want is that original content along with we think the planning information, because that's really a good source of descriptive metadata. And if you do this, if you kind of move toward this kind of improved architecture of systems, we think that allows that would allow news organizations to use their channel specific systems, say the web CMS, to optimize for the thing that it's most needed for, which is speed of delivery and not worry about any preservation elements, because it would always be only a copy of the content that's in that repository. So you have that that new duality where content is preserved at its highest and best quality with all linkages, and then channel specific systems can focus on channel specific needs. And then the in terms of recommendations, the third leg of those are these industry wide bigger, bigger ideas, long term thinking that may involve policy changes and partnerships between the institutions. So certainly collaboration across the industry, Neil, there's no reason why why that couldn't happen. Yeah, we see we see the opportunity for news organizations to be, you know, much more open in terms of sharing and collaborating on preservation processes, you know, possibly even pooling resources to preserve certain kinds or segments of news organizations can work together to to share common preservation resources. Well, and we're seeing newsrooms collaborate, you know, all kinds of new stuff going on in terms of collaboration of the newsrooms. So this this would seem to maybe tap into that trend. It does fit into that trend, you're seeing newsrooms, news collaborations all over the country. In terms of preservation technology, what we found is many of the providers of technology are actually a lot very interested in this area, but they haven't heard as much about preservation functionality and systems and capabilities as they thought they would from their customers. And so we think it's important to advocate for that, so that that kind of that that technology is improved, and does a better job of doing the things that preservation needs for preserving and maintaining for the long term the digital news content that we're creating. And then lastly, we do see an opportunity here for partnerships between the news industry and a number of institutions. In particular, we think there's a natural partnership with libraries. You know, libraries are are an institution that exists in almost every community, you know, right there side by side with news organizations with local radio stations, newspapers, perhaps the digital startups that are that are coming that are arising in communities around the country. And there's a natural alliance of interest there that we think is important, because, you know, libraries care about this. So these have been the institutions that have preserved, you know, much of the public record, you know, for for centuries. But they're they're they're struggling to in terms of resources are working together. I think there's a there's a real future for those kinds of partnerships with news organizations. Just as an example, I think it was fantastic what happened at the Denver Public Library. And that was, you know, because of relationship between the Denver Public Library and the Rocky Mountain News, Rocky Mountain News and Denver closed, I think in 2009. And the because of that relationship between the different public library and the scripts, the owner of the newspaper, there was a full transfer of the ownership of that content. And that has been preserved for the record, you know, in the different public library archives online for posterity. And that's a great success story. Great example, Edward. So we've touched on, you know, a good bit of of what's in the report, but there's a lot more that you can learn about digital news preservation from the report. If you go to rjionline.org slash reserve news, you can get the full report. There's contact information for for me, and I can continue in touch with Neil. I'm not sure Neil, I think you're on the rji website as a fellow. So anyone who wants to get a hold of this, go ahead. Go to the rjionline.org site, and you can find the report and our names. And we thank you for your attention. Appreciate you joining us. Thanks, everybody. Appreciate your interest.