 My name is Anand and this is Nafl. Both of us work for Internet Archive. This talk is about an introduction to Internet Archive and its architecture. So let's get started. So that's our office in San Francisco. Nafl and I work from India and the office is in San Francisco, United States. Before I start giving an introduction to Internet Archive, I want to ask you a question. So what's the average lifetime of an URL? Like once a URL is created, how many days it's going to be live and accessible? Do you have any ideas? So the estimate says that average lifetime of a URL is 44 days. There are some people who say it's 75 days and 100 days but that's around the ballpark figure. What does it mean? It means 44% of sites available on the Internet in 1998 have vanished in one year. So what happens to all this information? So there could be something which is very important and which is not available after that. So let's see if you can look at the URL. So let's see what it gives. I have a cash copy of that. So that's what you get. From that's a scientific article. It's not a site which is a small site but it's a very popular site but still you can't find that URL. So Internet Archive started a project in 1996 to create an Archive of Internet. And this is a scientific American article that we're looking at. And from the WebAc Machine which is the Internet Archive project that Archives Internet. So let's look at... So that's the Internet Archive website from 1997 from the Internet Archive WebAc Machine. WebAc Machine is a project of Internet Archive which started to create a snapshot of Internet from time to time. So it maintains a snapshot of Internet over time since 1996. So you might actually think why is it useful? It's very useful because there's a lot of information now available only in digital form. If the lifetime of this page is very less, how are you going to access them after some time? So let me show you something more interesting. So this is a blog post from CU of Sun about Android. So this says Sun is very happy about Android. So how will somebody find information like this? Let's say there's a code case from between Google and Oracle. So how will they find information about what happened in history which was available only in digital form? Without something like Internet Archive WebAc Machine, you wouldn't be able to see this information. So this information is available from Internet Archive WebAc Machine. So that's a page from October 23, 2010. So let's actually see how the WebAc Machine works. So I'm going to take an example of Apple.com as an example and then see how the website has been changing over time. So if you go to WebAc Machine and look at it and actually show a snapshot of how it hasn't been captured. The top bar shows when it was captured. And the circles here show that when it was captured, the date. So this is how Apple website was there in 1996. So the emails were not there, but let's look at the next year. That's 97 and that's 98, 99, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008. So that's how the website in 2009, 2010. So you can go to WebAc Machine and then actually see how the website has been changing over time or find some information, how it was a couple of years back. So WebAc Machine is one of the popular projects of Internet Archive. In fact, if you look at the Wikipedia citations, WebAc Machine is the second most cited resource on Wikipedia. That's because websites keep vanishing over time. So the only way to access them is through the WebAc Machine. So Archive is another project of Internet Archive. So it allows the partners to select and archive their selected websites. So a lot of libraries use this service. So other than the WebAc Machine and archiving, Internet Archive has a couple of more projects. One of them is book scanning. Internet Archive scans about a thousand books a day. Most of them you can see from their website. So this is the scanning center at San Francisco. And what you see there on the left and right are the book scanners. So those are the scanning centers around the world. And this is the Open Library. That's a website that's project started to create a catalog of all books in the world. So it has about a collection of 25 million books cataloged. And it has access to about a million free books which can be read online from Internet Archive. And it's like a front end to Internet Archive books. So the public domain books you can read from here and some books which you can borrow. This is a book reader of Internet Archive. So the scanned books can be read from using the software. This is a book reader. You just have to open your browser and then flip through the pages and read. So it allows you to search through the book and then go to the pages. You can have a table of contents. So it even lets you embed a book reader inside the web page. So one of the interesting things about this book is if you see the notes written on the top, it's actually a very interesting book. So if you actually flip through this and see, it has a lot of notes written like that. So that's a book on optics by Isaac Newton himself. And those notes are written by Isaac Newton with his signature. And in fact now you can actually hold that book while not in your hands but in your computer and then see through his notes. And this is the Internet Archive website. So it shows all its projects including text and audio collections. So if you look at the scale of Internet Archive, so it has about 3.5 million books and 170 million web pages archived so far. And my colleague Norful will now talk about how Internet Archive handles this massive scale. So this is a big data conference. It starts off with the size of the archives data. It's got about 3.5 million books, more than 170 billion web pages which is on the way back machine. So plus it has a lot of other stuff and totally it comes to around six petabytes of data. So I'm going to first talk about how the six petabytes of data you have to be able to manage this in some way. And how do you manage when you have large amounts of data? You impose some kind of structure on it so that you can actually talk in terms of larger abstractions. So the archive basically has something called items. Items are like the fundamental unit of storage on the archive. It's basically a directory. So the archive contains for 10 million items or so. And each item is basically just a directory. It contains the original files which were there in the item. It contains what are called derivatives. So suppose if somebody uploads, let's say I don't know, an mp3 of a song, the archive automatically creates orgs and other audio formats so that it becomes more accessible. So those are called derivatives. They're not original data but they're equivalent data that was generated by the archives software. And we have metadata files which contain things like when it was uploaded like who was the artist, which is the album, et cetera, depending on the type of item you're talking about. Each item is stored on two servers. There's a primary and there's a secondary. Usually they're on separate racks because the same rack gets fried and you don't want to lose your copy of the data. So we keep it on two separate machines. First one's called a primary and there's a secondary. So this is an example of an item page on the archive. That's the URL, archive.org details, Hollywood Without Makeup. You can see that on the top there's what is it called? A classification for this. It's part of the moving image archive, movies, short format films. You will have the original format which was uploaded, derivative formats over here. You can download all the files like this. And since it's a video file, you have an embedded player. You can actually watch this. Incidentally, the archive made videos like this in a streaming format available before YouTube, but a bit of tech history. This is the bottom part of the page. So you can see all the files which are over here. The MPEG4 version, the org version. We generate an animated gif from this so that you get a thumbnail kind of thing. These are the metadata files and there's a comment board so you can actually comment on different videos, enthusiasts, stuff like that. Items are the lowest level of data. Then there's a higher level categorization which is what are called collections. Collections are like groups of items basically. So it's used for, like you saw on the earlier slide, it said moving image archive. That's like a collection of videos. So we have different kinds of collections depending on functionality, depending on what it's for. So there's 911 footage. That's a collection. So news footage from when September 11th catastrophe occurred. Things like that. So collections are basically groups of items. These terms, they come from the library domain. So we have so many collections, that sort of thing. So this is an example of a collections page. So you can see that the collections called net labels. It's a sub-collection of the audio archive. So when you talk about collections now, you can say what all items are in this collection and what is the most commonly downloaded one, which are popular. You can find similar ones, et cetera. It's the way you look at things when you're in a library. All right. Now, since we have to do all this, you have to maintain this amount of data. So you should understand that the archive started in 96. And at the time, we didn't have the cloud. You couldn't just put your data onto the cloud and just leave it there. So we had to actually build custom hardware. So the Petabox, which is the Internet Archives machines, they were designed by the archive staff themselves. It was originally designed to hold one petabyte of data, one machine holds one petabyte of data. One petabyte was a lot back in the day. I mean, 96 one petabyte was quite a bit. It was designed low power. So I mean, these things are specs for the, which were obtained in 2010. Six kilowatts per rack, high density. So you have one rack can hold up to 650 terabytes of data. All right. They run without air conditioning, which is kind of cool. So the heat from the racks are actually used to heat up the building in the main office. You saw the office, picture of the office. It, there's no heating, central heating on the office. The heat from the servers is used to heat the office. And consequently, the servers don't need air conditioning. The next point I'll come to it. There are four data centers, totally with 1,300 nodes and 11,000 spinning disks. We use what you call commodity hard disks, like, you know, consumer hard disks. Cheap. So that was a bit of hardware porn. This was the original petabox, the one which was designed originally. This is the current generation petabox. This is what it looks like right now. This is an example of what it looks like inside the archives, great hall. So that church building, the church, they have a great hall. It's something that looks like this. And these servers are at the back of the hall. You can see that arch type thing on top. That's where the heat goes up through there. Then the question, how big is the internet? Since we have this business of, you know, the way back machine. We archive the entire internet. How big is it? And the answer is that it's one shipping container. That's 20 feet into 8 feet into 8 feet. The entire internet presented by the way back machine is stored inside this one shipping container. Okay, we'll just talk about the services which we have on the archive website. There are three basic things. These three things form the majority of the part of the thing. These three services form most of the archive infrastructure. The first thing is what's called the locator service. The locator service actually helps you find where a certain item is stored on this cluster. So you have a large number of machines, large number of nodes lying there. And the locator service allows you to say, when you say archive dot org slash details slash something and you ask to download a site, it has to actually find the server where the data resides. Then the, so it's a very simple program. What it does is it sends out a UDP packet to the entire, what do you call, the cluster and the machine which actually has the item response. It redirects to that. All right. And it redirects to that. The second thing is what's called the catalog. The catalog is something like what our task queue. So since the archives tries to store its information in a simple format, the speed is limited because of that we process things in an offline queue. So the catalog basically has tasks which execute as and when traversal sources become available. It's an old fashioned message queue. So if you look at the archives catalog, it contains a list of all the tasks that were executed and there are tombstones for all the ones that were completed. There are tombstones for which are really, really old tasks. You can still find them over there. Third thing is what's the driver? The driver is basically a piece of software that runs when new items are uploaded or modified or on demand. It creates new derivative formats from what was originally uploaded. So for example, if you upload a flak file, flak files are usually very big. The driver automatically converts this into mp3s, orgs, and other formats so that they become playable and more usable. It increases the usability of the data. Also, when we do this kind of thing, we augment the metadata from external sources like music brains and stuff like that. Another example of what a driver does is if you upload a PDF of a book, it will automatically OCR it and full text index it so that you can actually search through it. The new files are created and then the metadata is updated and the items sink back to the primary. Derivation is done by a catalog task, as I mentioned what the catalog was. This overview of our software stack, it's basically PHP, the front-end is PHP. You should remember that this was 96. So PHP, Nginx, Solar for all the searching and to render many of the pages. MySQL to store some of the metadata. Redis, I wrote Solar twice, I'm sorry. Redis is used primarily as a caching thing. And Python and Java in places. The Wayback machine is mostly in Java. We're trying to slowly move it to Python and stuff like that. KVM for virtualization and a bunch of tools like a typical Nagyos, Graphite, MRTG, CACTA, et cetera for monitoring and stuff. All right, now big data. What makes the archive different when it comes to big data? Our considerations are different. Most organizations, the big data is necessary for mining and analytics, visualization and things like that. With the Internet Archive, the reason what we do with big data is we preserve it. So because of this, the considerations are different. Our idea is to make sure that this data is available forever, in the long term. So we try to keep it that way. And that gives us a certain amount of, you know, certain kind of vision to do this. So give you an example. So if you were like, you know, back in the day, August 2000, if you had to save your valuable data, this was what was available. You get up to 100 MB of free space. All right? You could upload your stuff over there and it would preserve it for you. It's similar, it's what Dropbox was back then. So, you know, no matter where you go, everything on your X drive is just a click away. But the next drive, shut down. All right? So many of these things are not really long term. Another example is this. This was a iOmega released, you know, a disk that was supposed to be, you know, higher storage than floor piece. And your life is safe here. I think if you go to the iOmega site right now, all you get is cables for the zip drive if you already have an old one. So when we have to engineer things, we have to engineer it for the long term. That's the way we try to design things. All right? So our basic concern is long term preservation. We try to keep things simple so that, you know, like, we don't, you know, for example, when I say simple, one simple thing, one example of how we do this is we store the metadata along with the actual items in flat files rather than put it externally in a database or something. We also just use UNIX permissions rather than put it in a database separately, permissions in a database separately. We also try to be independent. We don't try to outsource our storage and stuff like that. We try to keep it within ourselves so that we have control over what's going on. And low maintenance because, like, you know, it's a nonprofit. The staff is limited. The resource are limited. So we have to try to keep it low maintenance. Low cost where it doesn't conflict with the rest of the things above. And the basic thing is we're not really concerned with the code. We are concerned with the data. The code will die after a while. The code will become obsolete and it will go off. Data should remain forever. It should always be available. That's the consideration. So that's it. Thank you. So any questions, comments? I have, okay. I'm Sandil Nagam. I have a couple of questions. First question. Do you get DMCA takedown notices? We do get takedown notices for some things as far as I know. Okay. Second part, okay. Are you exposing any APIs where we can face these data sets because these are excellent data sets? Yeah, I mean all this stuff is just directly downloadable. The open... No, no, no, no. Web via APIs. We won't have the raw access to it. So what kind of API are you looking for? Okay, I used to have a couple of sites which I don't no more own or manage. The data is still there. I would like to do what my thoughts were going then. I would like to have access to my data. Okay, so I don't think... Not through website. I don't think the Web App machine has any APIs as of now. So you're talking about the Web App machine? Any point, okay. Sure. But are you doing anything with news? All the news... Can I check news for any newspaper or any website over the last couple of years? I don't think we're doing for... Well, I mean you can look at the news websites and then go and look at Internet Archive Web App machine. But yeah, but we have a TV archive. You can go and actually look at how the news were in this TV channels over time. My question is about how do you handle the copyrights? Usually, all the data that you're archiving must be subject to a copyright. So what are the policies that you follow or does the proprietary organizations don't come to you at your weight? So the archive crawler respects the robot standard. So if you have a website which you don't want crawled for whatever reason, you can specify that. You can specify IA archive or something in your robots.txt and it won't crawl it. And also, you can also request retroactively to have this thing removed and stuff like that. So that's done. And then specifics of copyright depend on the actual media which is in question. Like if it's books, it's one thing. If it's audio, it's something else. It's a legal question. I'm not really qualified to answer it. Anything else? Do you actually do deduplication or anything like that on the back end too? Week after week, not all the data changes for a website? So especially with some kinds of media, like music for example, there are attempts to actually do deduplication. So if you're actually archiving a certain CD, we have some things in place. In fact, I'm working on some things like that right now which tries to tell us like this thing has already been archived. You don't have to do it again. And it's not very straightforward because things released in one country, released in another country and stuff like that. So at the track level, at the album level and things like that. So there's some work that being done like that, but I don't know if the user uploaded data is just uploaded and kept. So it's not. All right. All right. Thank you.