 Cool. So I'm going to tell you about compressing the Library of Babel. So the Library of Babel is a short story by Borges, but we're not going to talk about compressing the short story. We're going to talk about compressing the thing that is discussed in the short story, which is a library that contains all possible books. Well, so I'm going to say all possible books with an asterisk here. And the asterisk is that it doesn't contain books with illustrations like Breakfast of Champions, which has the illustration above. It does not contain the instructions, which has text going at funny angles for diagrams. It doesn't contain a house of leaves, because the word house is in blue. And it doesn't contain the last samurai, which contains some text in Ancient Greek. But what it does have is all possible books of the following specifications. You'll notice that that's 22 letters. I guess it's in Hebrew. It doesn't really say. So this is on the order of zillions of books. Way more books than I have space for in my library. And so let's compress them. And so the first thing we might try is GZIP, which typically has a 2 to 1 compression ratio. And this is not going to cut it, but it's not going to cut it for an even more fundamental reason, because actually right here it has a 1 to 1 compression ratio. And so to see why this is, let's first see how GZIP works. And so we're going to show you a video by Julia Evans, who's one of the founders of Bang Bang Con. And this is GZIP-ing pose poem, The Raven. And we're not going to have time to watch the whole thing. But what it does is when it sees some text that it's seen before, it says go back like 17 characters and then just copy over three characters. And so that sometimes works really well. And this is a particular case where it works really well. You can see that it's starting to copy like a lot more text, just a poem that tends to be pretty repetitive. But so the library of Babel has all of the books. So it has the one that's all A's, right? And here GZIP is going to do a compression ratio of like 1000 to 1. But it also has the one that's all gibberish. And in fact it has a lot more that are all gibberish. And here it's not quite a compression ratio, it's sort of more of an expansion ratio. So why is that? And the answer is the pigeonhole principle. And so the pigeonhole principle says that if you have 101 pigeons and you have 100 holes, then two pigeons will have to share a hole. And in this case our pigeons and our pigeonholes refer to compressed and decompressed text. So if you managed to compress some of your books, well you've used up that space in the state of possible compressed texts. And so some of your other ones are going to end up having to get a little bit longer. And you know, sort of naively you think, well I'm going to work around this, right? And you can put a one in front of the ones that I can compress and a zero in front of all the other ones, right? And then it'll totally work. But you have to account for the space taken up by the ones and the zeros. And so it turns out that this unfortunately is not a way that we can save space when we have all possible books. But we don't need to compress it in some sense because I told you what it is, right? I described it to you. And so maybe what we can do is I can just write a little program, right? And if I give you this program and you run it and you wait until long past the heat death of the universe, you will end up with your own complete copy of the library. And so this sort of form of complexity here is your size of your compressed data, which in this case we don't have any compressed data, we just have the size of our decompressor. And the size of our decompressor in some programming language. So it turns out that if you have it in some programming language and you wish you had it in a different programming language, you could just put an interpreter for the first programming language on the front. So most of you have this constant amount of expansion based on the programming language you choose. So it doesn't really matter which one you choose. So the short story discusses these hexagons. And in between the hexagons it says, well, there's one room for sleeping and there's one room for taking care of the bodily necessities. That is the bathroom. But it doesn't discuss toilet paper. And so the librarians have been using some of the pages from these gibberish books as toilet paper. And now we have to figure out how we're going to compress that. So how we're going to represent this missing data. And so the pigeonhole principle actually is going to bite us here too. Because when we have a relatively sparse amount of missing data, right, we actually have to say which book is missing the data. And we'd like to say, well, it's book 17. But the number of books we have is so vast that actually to represent a book's number is to represent its complete contents. Then we have to say what page number is missing. It's actually worse to be missing data in this case. Because now we have, in addition to a whole book, we have a whole book plus the page number of the missing page. But as the librarians live there across the generations, the amount of missing data becomes a little bit denser. And so at that point we can say, well, we'll use a page bitmap. We'll use one bit for every page and we'll have a one if it's missing and a zero if it's present. And during the early or late phases of the process, we could actually compress this bitmap. We say, well, we have a run of a bunch of zeros and then we have a one. Then we have a run of a bunch of zeros and then we have a one. In the middle of the process that we have a problem, because assuming that they're removing pages from books at random, the ones and zeros are random data. And again, because your pigeonhole principle are going to have a bad time compressing that. And finally, as we get down to the last few volumes, we can just list the volumes that remain and that's going to be the sort of most efficient source of data. And it turns out that if you do this in something like the real world, right, that is you have a library that contains just the useful stuff, then you have Wikipedia. Well, okay, maybe. So now we're going to compress Wikipedia. And so here's some compression ratio that we got from the large text compression benchmark. And this is not all of Wikipedia. This is one gigabyte of one particular version of Wikipedia from way back that he's chosen to use as a benchmark. And we see that our GZIP gets about three to one or BZIP two gets about four to one. And then there's this, the best thing he's got is this PHDA nine, which is eight and a half to one. And that's, that's pretty good. The interesting thing is if you compare the ratios on the 100 megabyte chunk of Wikipedia to the ratios on the gigabyte chunk, the ratios are actually improving on the gigabyte chunk. So what I wish I had, but we don't, is the 10 gigabyte chunk and 100 gigabyte chunk. And I think 100 gigabytes is probably about all of Wikipedia now. And you'd like to do better than this. And one reason you'd like to do better than this is because of this guy, Marcus Hutter. And his theory is, so starting from the premise that Wikipedia makes sense, which I think it mostly does, right? When you're compressing, you're making predictions about what comes next. And so GZIP, for example, says, well, what comes next, it's probably a lot like what we've already seen. And so this prediction requires some level of intelligence. And GZIP is sort of the not very bright level of intelligence because it's using an algorithm from 1977 or 78, I can't remember. But as you get into the sort of more advanced compressors, they actually end up mixing thousands of predictive models for what text is going to come next. So Marcus Hutter, who's apparently a respected AI researcher, says that compression is intelligence. And he's put his money where his mouth is. If you can compress 100 megabytes of Wikipedia better than the current best compressed version, you can earn a fraction of the 50,000 euros that he's put up. And in the process, you can make compression smarter. Or at least that's what he thinks. So Borges actually has sort of a funny answer to this. And since I have an extra couple of minutes, I can show you this objection. He says, actually, the library has all the stuff that we think is nonsense. But it's not nonsense. It has some sort of meaning. You just have to take the right interpretation of it. And so to say, well, if we predict this, that's intelligence. Maybe not. Maybe the real intelligence is taking a string of apparently random letters and figuring out what it means or what you could make it mean. I highly recommend that you read both this short story and all four or five of the works that I listed 17 slides ago. And thank you.