 My name is Architer Kukarken and welcome to my very humbly named talk about decentralization at the Internet Archive. So a couple of quick caveats. This is not strictly official abuse of the archive, it's sort of my own exegesis. This was submitted as a 30 minute talk, turned into a lightning talk and now is 25 minutes somehow. So it's going to be a little weird, but that's okay. And thirdly, it does not contain Ethereum currently. This may change, hopefully as an outcome of this session. We'll see. Okay. Let's get started. So why decentralize the archive from a spiritual or ideological perspective? So since you're here, you probably already know what the archive is, but just as a quick recap, probably best known feature is the way back machine, which has been archiving the known web since 1996. It has trillions of captures in a given URL. You can punch in and travel back in time quite far. Of course, that's not all. Okay. So we have thousands and thousands of other collections, everything from books to film, to wax cylinders and vinyl. We digitize it ourselves. We partner with other institutions to preserve their collections and so on. That's a lot of stuff. Okay. And we are also perhaps known for our beautiful headquarters, which is here seen in its previous incarnation as a literal church. And inside is our own terracotta army, which by the way is I think the best employee perk ever. After three years, you get a half size statue of yourself made and it's placed somewhat creepily on the side of the great hall. And within the very same room is actually some of our servers. We also have more conventional data centers, but I think for the purposes of this metaphor, this is a very good image. These are petaboxes. At one point they held about a petabyte of data each, now it's much more. So yeah. So we've got four walls. We've got a riff. We've got our hardware serving you, your Grateful Dead live set from 1972 or the White House homepage from the 90s. And that is how we accomplish our mission of universal access to all knowledge, right? Well I think actually for a long time this has been true. It might be somewhat true today. But before I try to answer this question, I just want to play this clip off our founder from the very, very beginning. You may have seen him earlier in the week and here's a much younger version. And there's no audio, but there are subtitles. And he's basically talking about his thinking in starting the archive, which at that time was just a web repository. And I want you to take two things away from this clip. First is that this is really a web native organization. We do have all this other content and we value it very much, but the web has largely eaten the world at this point. So it is in our DNA to essentially be the missing memory layer of the web. And up to this point the best way to accomplish that was put a bunch of stuff in servers in a box. And the second thing is that underlying our top line mission of universal access to all knowledge is this technological imperative to periodically assess the tools that are now newly available, whether they can help us further that top line and apply them accordingly. So I think for the first time since our founding, we actually have something that's not just better servers, bigger hard drives, better scanning equipment. We have a possibility of potentially fixing this problem in a way that is universal and not just embodied in our service. So to zoom back in a little bit, why decentralize the archive from a practical perspective? Well, one is physical location risk. So this is a map of seismic risk zones in the Bay Area and the little logos there correspond to our facilities. And that's not great. We're actually building a new data center in British Columbia currently. Unfortunately, that's also on a fault zone. And you might say, well, why don't you stop building your data centers in seismically active regions? And for reasons that are outside the scope of the stock, this is actually surprisingly difficult. Secondly, political location risk. So we're addressing this a little bit with the Canadian expansion. But as you can see on this map, this is essentially a weighted index off the democratic quality of various countries. And we're not doing great at it. And it's really trending downward. So this is a real problem for us because we have a lot of stuff that people want silenced. And thoroughly, we have pretty significant network bottlenecks. So this is the user signups over time. You can see there was 16 in 1996. That's very cute. And during the pandemic, it just went crazy and has never really let up. So in practice, this means our bandwidth is just totally cooked. We are putting fiber in as fast as the city authorities will allow us. And it's really just not enough. So you might ask, why not just put all your stuff on S3 and use cloud fronts and not worry about any of this stuff? So there are multiple reasons. Some of them are also ideological, but they're also practical ones like cost. So our modeled cost and actually kind of real cost is around 650 bucks per terabyte to store forever, which in the model means 100 plus years. On S3, that's about 160 bucks per year. So that multiplied by forever. That's not a very nice number. With Filecoin, which is sort of presaging, but we'll get into a little bit later, the costs are vanishingly low and might in fact be negative for us because there is a network subsidy for culturally valuable data. How sustainable that is, we'll find out. Storage is another decentralized storage network. They're sort of a little bit more here and now thinking. So they're competing with S3 and they have a slightly better cost. There's also Rweave, which you might be familiar with as the forever storage solution, and it is expensive. It is quite pricey compared to our model. And I'm a little bit unconvinced about their data availability model for stuff that's unimportant until it is right. So things that are sitting idle for a very long time and then because of a political moment or something like that, they suddenly become relevant. So if you have other thoughts, please talk to me and convince me. I'd love that. And there are others I haven't mentioned here. I'd love to talk to swarm folks, for example. I don't know what their pricing model is. So please, if you're with that team, find me later. And aside from all the dealing with negatives, there are also positive new opportunities with decentralized storage and decentralized networks in general, like content addressing, which a lot of them support, which, for example, can allow for transparent link preservation. Right now, when a link breaks, we may have a backup copy. If it's on Wikipedia, we have a bot that will go in and edit the broken links and the citations with our backups. Or maybe if you're an enthusiast, you'll be using our extension. But for the most part, an HTTP link, when it breaks, it just breaks. And of course, there is Web 3 Forward Compatibility as a nice bonus when this stuff is in a decentralized network and it's content addressable. You can reference it using standard IPFS libraries that I'm sure you're familiar with and so on and so on. So I'm going to quickly jump into a couple of things that we have working today as steps in this direction and then maybe come back to the musings about how an actual decentralized Internet archive might look like. OK, so the first one is the simpler one. And this mostly addresses the bandwidth concerns. And it's streaming media from storage. And if you'd like to try this yourself, you can do it. We don't currently have a real front end. So you'll need to grab this little bookmarklet from the presentation and just add it to your toolbar. And you will be able to follow my example. All right, so here we have a video from NASA. And we have a whole collection of their videos. And in the metadata here, you see some internal identifiers, this archival identifier ARK, and then this storage string. So any archive item that you see this string on is actually mirrored on storage. So you can just hit this bookmarklet. And hopefully we do not get demo fail here due to network issues. No, it's working great. So here is a pretty big video, full resolution, mp4 4 gigs. And so actually, yeah, that loaded. That's fantastic. So here we have some wacky astronaut things. And as you see on the right here, it's being served from almost 3,000 nodes worldwide. The way storage works is sort of like an incentivized bit or in swarm. It's not quite the same, but the basic principle applies. They use erasure coding and basically pick 88 of the fastest nodes to give you a nice, robust stream without relying on something like CloudFront or expensive edge caching. So let's close that. All right, so what do we need for this to actually be useful to y'all? Well, one is obviously an actual front end, and that's assigned to me. We need browser support, because in this case, I was sort of cheating a little bit because I was actually using an HTTP terminator where my browser was talking to a storage service that was then talking to all these thousands of nodes. It's actually an HTTPS connection to the nodes. So all you need to make this work in the browser in a truly peer-to-peer fashion is self-sign certificate support in a particular context. So if you were brave or opera or another browser that wants to make a play in this space, please talk to me. And lastly, as with many decentralized systems today, there is actually a centralized point of failure, which is the metadata that essentially tells you where to find all the pieces of the file. So we're currently using a storage hosted, what they call satellite metadata service. We'll probably be running one of our own soon. But the real solution is, of course, not to have the single point of failure. So that is on the to-do list. OK, so the second demo is a little bit more complicated. And for it, you might need to have a crash course in web archiving. So the canonical format for web archives is called work. And it's an IRB standard that the archive has developed back in the 2000s. And it's basically just a dump of HTTP traffic between your browser and a given server. And the actual file structure is basically tarball or concatenated dumps of these different crawls. And this is how the entirety of the Wayback machine works. It's just a bunch of these giant files that are HTTP dumps and some index data that tells you what offset in the file to read. And out comes the web page. So let's see. All right, so I'm going to go back to the demo here. So here's an example of such a work file. And it doesn't look like much here, because this is sort of internal use. If you open this up in a different front end, you'll actually see the web pages. But this is kind of how the sausage is made internally, just these files. And we have, again, in the metadata fields, we have these identifiers, Identifier CID and Compi. And these are identifiers within the Filecoin network. So we've been storing a lot of these web crawls. This particular data set is our pre-inauguration crawl. So every US presidential election, we capture the entirety of the .gov domain and associated things. Before and after the administration change to see how the politics reflect in the reality of the government web. So we're using those data sets as a test bed, because it is in public interest. And it is also not copyright encumbered, as US government data generally is. So we can pull up, oops, actually this one, just grab this identifier here and go to a Filecoin network indexer. So to avoid demo fail, I've actually pre-filled this. So I'm not going to breeze on it, so it doesn't pull over. So here's that content ID. And it's founded a couple of peers where we stored it. So we can map this peer ID to a minor ID here. And let's see. So we'll grab it from that provider. And we'll grab it by, oops, not that one, by the identifier. And for demo purposes, I'm just going to discard it, but oops. All right, fantastic. So here we are retrieving the bulk package from the Filecoin network. So at the moment, we're treating this as just a dumb blob. So it's being stored on Filecoin as essentially another copy. We have a couple of internal copies. And this is a third sort of cold storage thing. But things can get a lot more interesting. So we can take a look at some other tools in this space. OK, it looks a little weird of this resolution. But here we have a capture of the devcon.org site. And this tool, it's not an internet archive tool, but it's sort of affiliated. And here we have made a capture of the site. And we can store it on IPFS in Filecoin through Web 3 Storage. All right. And now this bundle is a static self-hosted application that provides a view into this web archive the same way that you would get on the Wayback machine. But there's no server, right? This is just loaded from IPFS. And it has all the nice, archiving stuff built right in. You have this timestamp. You could travel back in time. If I had earlier copies, you can have the links reference and so on. So the next steps in the Filecoin work for us are to try to make our captures structured and compatible with something like this so you don't actually need our servers to interact with this data. And beyond that step, there are a few things we need to actually make this work for real. So one is actually encryption and ACLs. You might think of sort of public good information sets, such as our collections, as being essentially open. And that's generally true. But because we capture from the open web and some of our collections do come from sources where just so much stuff goes in that it's very hard to do diligence at ingestion time, there are opportunities where things do require a certain degree of access control, in case of legal action or something like that. And there's also kind of a broad concern about data mining of these sets. So we seek to primarily support users and good faith researchers. But there is a broad spectrum of use cases that go from kind of white to gray to black. And the most, well, two most important things is indexing and metadata. So right now we're able to store this bulk data fairly well. We've spent about a year on this. And it seems like kind of a simple thing. But at the end of the day, it turns out not to be. But the thing that we don't have a solution for right now is the index. Because we have, as I mentioned, trillions of captures in the Wayback machine. We have billions of other objects. And you can't put that stuff on chain. And it has to be discoverable somehow. So I'm very open to suggestions from the audience after my talk on how we can attack this. And the second most important thing is scale. So we have hundreds of petabytes in our collections at the moment. And that's a lot of data. I think most folks here who work on chain things, you're probably dealing with things that are at most a few megabytes. It's really not that much data that we're used to dealing with in the blockchain world. And we've got tons and tons and tons. So that's something that the Filecoin team has been very supportive of. But we are just getting started. All right. And I think that's it. I guess the question I leave you with that goes back to the beginning is can the web have a memory? Not just Web 3, which is already somewhat set up for that, but all the web content. Because we value culture that is not just this narrow domain that we inhabit. We value all culture, right? So how do we preserve that forever? Thank you. I have a couple of minutes if people have questions. Are there any questions? Right here? So when you're archiving the web, very often web servers differentiate what content they serve, depending on the IP address from which you ask. And it seems to me that there's an opportunity for decentralization here as well. So that your crawlers actually pick from which region they download the web. How are you dealing with this problem now? That's a great point. So we definitely run into these issues as you're just being straight up blocked or a page that has meaningfully different appearance, depending on the requesting region. Right now, we have a few semi-decentralized tools for this. There is an organization called Archive Team, which is basically a volunteer group that runs crawlers that receive these tasks. And there is also academic research in this where different archiving organizations would enter into a consortium where they would synchronize their crawls from different regions. But I agree that those are not super scalable solutions. So I think that my short answer is yes. Thanks for the talk. How do you deal with takedown requests like DMCA takedowns? So unfortunately, we are required to comply with that to continue existing as an organization. So we have a legal team that processes them. And if they have merits and are made in good faith, we will generally block the item from being served. So unfortunately, time is up. Thank you so much, Akati. And I'm pretty sure they can find you afterwards, right? OK. Fantastic. Thank you very much.