 We're going to start simple, Jake, because I didn't know anything about text compression or, I mean, it's not even text compression. We're just going straight in, by the way. I just realized I started, but no, no clap. No, we should just get in the zone. OK, three, two, one. So, Jake, we have spent considerable amount of time in the last years getting our heads into image compression and how images compress. And what is interesting about compressing images that the result of the decompression is looked at with our eyes and interpreted. What? Hang on, that sentence was just a load of words, do I? What? What was that? Can we do that again? The result of the decompression is just looked at with our eyes and interpreted by our brains. Oh! It did make sense. Yeah. That's fine. I think it was a job. You know, when you just, you sometimes parse a sentence incorrectly at the start and it just messes you up. It just doesn't make sense, doesn't compute. Yeah, no, I get it. It was unnecessarily floofy. What I'm saying is, when we look at images, our brain does a lot of interference in trying to help us, but interference nonetheless, which is why image compression is so interesting, because they can actually change the image to make it more compressible in a non-recoverable way and get away with it, because our brains do the work for us. When it comes to text. Lossy compression, right? So your JPEGs, your AVIFs, stuff like that. With text, it's different. You can't really, you know, lossy compressed text. It will most likely end up not being intelligible anymore. Yeah, it'd be fun to try. Like those optical illusions where they've rearranged the middle of all the words, but it's still legible. That would be interesting. Let's do lossy text compression. I mean, I'm talking about text compression, but really what I'm talking about is just lossless compression in general, because the same goes for all our HTML, CSS, and JavaScript that goes over the wire on the web. We want that exact same code to appear in the user's browser so that the actual write program executes. We don't want that to be lossy. And for reasons that would break the scope of this episode, I wanted to compress a piece of user input with Gzip. And yes, Chrome has the compressor stream, but nobody else does. And so I had to do it in user land. And I looked up libraries, and they were all big, and I couldn't read them, and I got frustrated. And so I was like, you know what? I'm going to write my own. But to write my own, I have to understand how Gzip worked. And I've never, not in university or anywhere else, ever looked at how these worked. And I thought they are black magic, and it's going to be really complicated, but I'm going to use this opportunity to get a basic understanding. And it turns out it's both easy and complicated. And I thought I would walk you through it. Yes, please do it. Because I have no idea. It's one of those things where when I'm a couple of libraries that I maintain, I try and get the file size as small as possible because it looks good on your MPM page. I'm like, this is only 800 bytes or whatever. And the amount of times that I go into the code, do a bit of code golf, make it smaller, and then look at the Gzip result, and it's now bigger. And I don't know why because it's magic, but hopefully now you're going to tell me. Yeah, I think at least maybe you will have a bit more of an understanding why certain changes make Gzip perform worse. So we are going to start from nothing. And the reason I'm looking at Gzip is because on the web, it is one of the most prevalent lay supported formats. Whenever you send a request to a server, you usually, not you, but your browser, sends along which compression formats it understands. It's not called compression, it's called encoding. I think it was technically other encodings existed where the primary goal is not necessarily compression. But I think most of the time nowadays, these three is what a browser will send you. And I will not talk about broadly because I have not looked up how broadly works, but we will talk about the flight and we will talk about Gzip. So after starting my research, I actually realized that pretty much all modern text compression go back to two basic compression algorithms from the 70s, one of them being LZ77 and the other one being LZ78. And the rest is just smart combination of pre-encoding and post-encoding transforms to make things even smaller. Yeah, I remember these formats coming up when I was compressing things on my Amiga 600. So yes, it's been around a long time. So Gzip and the flight both depend on these. But as I like, there's these two basic encodings, LZ77 and LZ78. They're not specifically formats, just like algorithms, principles. They have a whole lot of offspring. All these one row down from LZ77 and LZ78 start with LZ because they're based on the same approach, but use, they add Huffman encodings or use other optimizations or use Markov chains. And basically everything you have at the bottom, the formats, they rely on those or variants of those algorithms. So for example, GIF uses LZW, PNG uses LZMA, if I'm not mistaken, no, PNG also uses deflate. Gzip also depends on, it's super interesting. 7zip is LZMA, now I remember. So yeah, it's really interesting. And so looking at LZ77, LZ78 is pretty good bang for the buck because it is the foundation of pretty much all modern, lossless compression that's happening. But we are only gonna look at LZ77 today because it is actually decent to understand. So this is a piece of service worker code that I ripped out of Scrooge. You know, we have an install handler and activate handler. And the basic idea behind LZ77 is, is to be like, you know what? There is a piece of text that we already had before. And by the time we would send it a second time, why do we actually send it a second time? Instead we could say, don't send those bytes, but instead send a tuple saying, go 421 bytes backwards and copy 23 bytes from there on. And if you paste them here, you will end up with the exact same result as if we just send the entire file byte by byte. And this is really good because compared to some other compression formats that have like a dictionary at the front or a tree at the front, like with Huffman. This format streams because straightaway you can get the text and all of the bits of compression that are back references. So it doesn't damage streaming at all. Exactly. Now, if you're a close observer, you might be asking yourself, well, to a computer, all of these things are just bytes and numbers. How do we distinguish whether the numbers we're getting are one of these back references or if it's, you know, just a literal byte? And you know, there's multiple ways to solve this. And what LLE77 did sounds a bit stupid, but that's what it does because they say, you know what? We are not gonna have literals at all. We're just gonna make everything, not a tuple, but a free tuple where we have the distance, the length of the back reference and the next character, like this. And so if you don't actually want to have a back reference, you just put zero for the first two numbers. And so an entire file ends up looking like this. You know, there's 00S, 00E, 00L, and you spell out self.add. And the second D can actually become a back reference because we just had a D, a byte earlier. Now that's- That sounds inefficient to me. It sounds slightly, you know, it is. Because obviously what used to be one byte for literal is now, I mean, it actually depends on how much it is because you can't just have arbitrary number lengths. You limit the length that you can back reference and how long a back reference can be. So for example, you could do 12 bytes of back reference distance and four bytes of length. So you can at most go four kilobytes backwards and you can at most have 16 bytes of copy paste. Most commonly you have 16 bytes, 16 bits of back reference. So 65K that you can go backwards and up to 256 characters that you can copy paste over. So that means what used to be one byte for just a character is now four bytes which is actually making your file a lot bigger. And yet, from a certain size onwards when at least it's human readable text and not, you know, already compressed binary files, this works, this works quite well actually. So if we take a full JavaScript file, that is, you know, let's say one kilobyte and above as a random number, you will actually end up with something smaller on the other side. And I thought that was quite impressive. I didn't expect that. That is quite impressive. I didn't know that. I wouldn't expect that. But obviously for smaller files, if we think about, you know, a small 20 liner, you will often also end up being quite a lot bigger. But, you know, it took him a whole seven, not seven, five years it seems until that was, I mean, I'm sure there were other intermediate formats but this is the next one I'm looking at. I guess five years later, LZSS was introduced, which made a couple of observations that they wanted to change. They said, you know what, instead of making everything a tuple as a back reference, let's do something else. Let's just emit a single bit where the bit zero means the next byte is literal and the bit one means the next three bytes in this example are one of those back reference tuples that we want to actually copy paste from what we've already decoded earlier. So what used to look like this in its binary representation, you would now have these individual bits in between the bytes, so to speak, to tell you how to interpret the bits that come afterwards. A downside is of course that, you know, now you're not byte aligned anymore. You have to work at the bit layer and shift around and it works, but it gets kinda messy. And, you know, it also gets slower because reading bits is slower than reading byte aligned bytes and everything around. And so they did something quite clever, I thought, which for some reason when I look at it, I didn't realize, but somebody else mentioned in just a blog post that I found is that they just put eight bits of those flag bits at the start together and now you're back to being byte aligned. So now you look at the first byte and say, okay, I have seven literals coming up in one tuple. So you read the next seven bytes, and then afterwards you have three bytes to read because one is the back reference distance and one of them is the back reference length. And then you start over. The next byte will have all your flags and so on and so forth. And the other optimization that LZSS did is that they say, you know what, a back reference of length one turns what is one byte into three bytes. That's not good. Let's say we only start doing that for any back reference that has length three or above. So now we're at a point where for every eight bytes, we add one other byte for flags, but we also add the capability for copy pasting back references. So that obviously seems a lot more efficient and we have gotten a lot closer to a good compression ratio, even on smaller files. So when did this compression come into it? So is this the things like PNG? Does that use that format rather than the? PNG uses deflate and deflate is based on LZSS, which is why I explained it because basically, I mean, that's what I'm gonna talk about next is the LZSS or all the LZ algorithms at the start, the 77 and 78, they kinda want to avoid repetition and either use a dictionary or this kind of back referencing to avoid sending what's already been sent. And that's only one part of the puzzle piece because if we look at our JavaScript file, you can see that not all characters are created equal. You can see, you know, the E appears quite a bit in our certificate file while the X and the Z don't appear at all or something like a Y only appears, I guess, once or twice. I don't know. The whole point is like why are we using eight bits to describe the letter E when it's so frequent, but also eight bits for something that's so infrequent like the Y or the K? Now, the reason is that this is very easy to code. We know we always ever read eight bit at a time and that is our next character. If we were to, you know, give different lengths to each individual, with different amount of bits, how do you know how many bits you have to read for the next character? And if you send a length of the next character beforehand, you probably are undoing all the savings you're trying to do just now. So now you're going to talk about the thing I mentioned earlier, which is Huffman or Arithmetic encoding, right? Those are the sort of things that deal with that entropy coding side of things. Exactly. There are entropy codings, which basically just means entropy is a statistical term and they incorporate the statistical frequency of the symbols, as they are called, that are used in the thing you want to sense. So one really interesting thing about trees in general is that if you assign letters to edges of the tree, that now you have a way to describe each leaf in the tree with a unique code that is self-terminating, as it's called, because you know by just traversing the tree when you have reached a leaf and you just decoded a letter and you can go back to the root and start all over again. You don't need to encode the length of the symbol itself, of the code itself because it is self-terminating. It tells you when you are done decoding letter. And this principle is what Huffman trees, now we're finally talking about them, are based on. They were actually already a thing way before the actual compression redundancy algorithms. Wow, I didn't realize they were that old. Wow. Yeah. So the basic principle, I'm gonna give a quick algorithm of how to construct a Huffman tree. That's not how Scrabble works. That's not how Scrabble, oh, it would be nice. As you said, like for entropy, you need to know the frequency of each letter that appears. And for that, you basically have to have the entire message. So for compression, you need to have everything in memory and count how often each individual letter, byte, symbol, whatever you wanna compress appears. So in this example, I'm just limiting it for four and we have a lot of ease, couple of SS, couple of keys and almost no ends. Now to create a Huffman tree, you sort your frequencies. So you sort the letters by frequency. So you put the lowest one at the right, the most frequent one at the left and you group the lowest two into a common node and add the frequencies to that. That's basically you start constructing a little tree. So now you have an array of two literals and a tree. And now you start over, you keep doing this over and over. You sort and you group the two least frequent ones. You sort and you group. And then you end up at some point with only one single element left, which is a tree. And that is how we arrived at the tree. And what is interesting is that this is actually proven to be optimal. Huffman codes are an optimal encoding for the frequencies given. Now, optimal in the sense that if you know the frequencies perfectly beforehand and that each frequency or probability, as they call it, is independent. Now we know that in English language, it is very unlikely for a J to come after letter X. So the probabilities usually depend on the text that you're on the length of your encoding. So it doesn't, in real life, the Huffman code is necessarily optimal, but it is still very good. And as you mentioned, arithmetic coding is an alternative encoding. It's harder to implement, but can get you closer to optimal encoding in more of the cases. Yes, and Huffman is what powers JPEG. That's the lossless part of JPEG encoding, is Huffman. Huffman or arithmetic? It's Huffman, arithmetic is in there as well. But at the time that browsers came to implement JPEG, or while everyone was implementing JPEG, there was patent issues around arithmetic coding. You know, although JPEG does support arithmetic encoding, which would be smaller, it would produce smaller sizes, no browser supports it historically for that reason. And really we've got that format now, so there's no point in adding it. Interesting, I didn't know that. So now we can basically take a file. We count which byte values appear more frequent and less frequent, build a Huffman tree and encode it. And now we just emit these byte sequences. And again, this by itself can already do quite a lot of damage. For example, I think I applied it to the service worker file and did only Huffman encoding and it compressed the data down to 70% of its original size, which is pretty good. However, that is without sending the tree. And you obviously need the tree to decode the bit sequence because otherwise it's just meaningless bit. It actually looks like gibberish. Now the problem with that is that these dictionaries are quite extensive. In this case, I only have 26 for the alphabet, but if you imagine we have a bit sequence for every possible byte value, that means we have 255 byte sequences of different lengths, that will take up quite a bit of space to encode and can again undo a lot of compression. So only on bigger files, you can probably break even or even get an advantage for it. And this is where I learned something that I also didn't know before, namely that in the deflate spec itself, they give you an algorithm where the only thing that you need to send is the length of the bit sequence for each symbol. That is enough to actually reconstruct the entire Huffman tree and as a result, the entire dictionary. And that is obviously a lot easier to put into a little file than an explicit mapping of symbol bit sequence and bit sequence length kind of thing. That's very smart, I like that. It's an RFC and it looks intimidating, but trust me, it is actually fairly approachable. The algorithm is on the bottom half here with three steps and has some pseudo C code. I think for most people, it will be quite easy to read it. All right, and now to the big reveal, deflate. Deflate is pretty much the combination of LZSS and then running Huffman on it. So basically what we do, we do our LZSS. We try to find byte sequences that are reoccurring and replace them with tuples or with back references. But what is interesting, because we are now doing the Huffman encoding as a second step. In our previous Huffman encoding, every value we decoded was just literal. So if we decoded value 255, we knew, okay, we have to emit byte 255. If we decoded value two, we knew we had to emit literal for the value two. But we're not bound to the limit between zero and 255 anymore because we're gonna Huffman compress it anyway. So what they're doing, and I think it's really clever in deflate, is that they're assigning, they're using the values above 255 as well for basically instructions for the decoder. So for example, 256 signals, we have reached the end of the stream or the end of a block. We're not gonna talk about blocks, but it's the same thing effectively. And the even higher values stand for the back references. So if you decode a value of 257, that means you just found a back reference of length three and so on and so forth. Now obviously, as I said, our back references can have lengths up to 256 themselves. So they're not gonna add another 256 symbols. Instead, they start realizing that the values, the bigger back references are commonly very unlikely. So they started to create groups, where they're saying, okay, for example, 265 means you have found a back reference of either length 11 or length 12. And we're gonna put one extra bit in the bit sequence to let you know which one of the two it is. Later on, they make the groups even bigger and so on and so forth. But this I thought was a really interesting revelation in that the Huffman encoding allowed you to use more values than just eight bit per instruction, so to speak. That's very smart. I like that. I didn't know that's how it worked. I figured it would be like one byte to say, I'm now going to give you multi-bytes worth of data. Like it's similar to Unicode, I guess. But yeah, this is a much smaller, neater way of doing it. I like it. Yeah, so this table again can be found in the spec what the exact mappings are in case you are interested in that. So this is where we've gotten so far. We have a raw byte data stream from our file. We run it through LZSS. And then that resulting stream of literals and back references tuples is Huffman encoded and we create a Huffman tree specifically for the frequencies as they appear in the file. So if a file contains a lots of zero literals or a lot of back references, those will get a smaller bit sequence than something that barely ever appears. So that's why it is a very adaptive compression to the individual contents. As I said though, the back references are not only a length, but also a distance. And for that, they do the exact same thing. They create the, it's a separate Huffman encoding just for the distances because they have a very different distribution. So if they found it, so it's worth creating its own Huffman tree. And again, they do this grouping scheme the same as with length. And so in the end, what you end up with, you do your LZSS and you do separate Huffman encodings, get two different Huffman streams and you kind of interleave these codes on the bitstream from one tree and the other tree. That's cool. So the Huffman trees will, I guess presumably be at the start of the file or is it? That's what we're gonna talk about next because obviously, as I said, you need to send them to be actually able to decode these bitstreams. And so this is what they, what you could do. You could just put the Huffman tree for the literals and length at the start. And we know just using the lengths of the bitcodes is enough to reconstruct the entire Huffman tree. Then you put the tree for the distances next and then you put the bitstream. And again, that's what I said, and you're gonna tell me why that's not good enough. Well, it turned like the Huffman tree for literals has 285 possible values. The Huffman tree for distances has 30 possible values. That's 300 bytes just to declare how to decode before the content even starts. And that is quite big. And especially in the age where this compression was specified, I think 300 bytes was quite considerable amount of space to not waste, but to use. And so what they're doing, they're Huffman encoding the Huffman trees. Of course they are, of course they are. So you have these series sequences of numbers that just define the lengths of the literals, which is enough to create the tree. But these numbers are now a lot smaller. There's very little variance between them. There's a length four, length five, length four, length zero, length one. And even this Huffman tree for the Huffman trees has some instructions that has a special code for repeat the last encoding length three times or fill the rest with zero. So those instructions are also in there. And again, that's all in the spec. But this actually does reduce everything quite a bit. And that's here is how deflate works. Deflate has a bit more capabilities. It has a version where a predefined Huffman tree pair is used, you don't have to encode any tree at all. It will often not work as well because it is not adapted to your specific content. But yeah, that is basically how deflate work. Does deflate give you the option of providing your own tree or is it always just a tree it has built into the format? No, so it has three modes. It has no compression at all, just like concatenate couple of streams in there basically. It has use the predefined Huffman trees as they are in the spec. Or, and I think that's a default mode, builds the spoke Huffman trees for the content that I'm giving you. Gotcha, yeah, okay, that makes sense. But what I'm showing here is this bespoke version because I think that's the interesting one. That's the one that really squeezes out as much bytes as possible out of your content without actually introducing any loss or leaving many bits on the ground. So yeah, we made it halfway, Jake. You now know deflate, let's talk about Gzip. Gzip is deflate, we're done. Excellent, brilliant. Yes, no, I think I knew this bit. It's just a wrapper format around it, right? It pretty much is, they add a bit of smarts as far as I know, a block scheme that you can actually change compression for different parts of the stream. They have a couple more headers and they have a checksum in the end for integrity checks. But in the end, it is deflate, which is interesting because the fact that browser advertised support for both basically means they support two different, they understand the output of two different libraries. Deflate stands for the Zlib library while Gzip stands for the actual Gzip specified format. Now, Jake, I hope you have a slightly better understanding of why sometimes there are cases with JavaScript, but most importantly with CSS, I see this a lot happening, where using a minifier can make your file bigger after Gzip than without the minifier. Oh, yes, because the minifiers on CSS don't do a lot. And I guess that's just because it's removing white space, right? Is that the... Yeah, I think so. Maybe sometimes some grouping or stuff like that. But in the end, that sometimes removes certain repetition that was there before and makes it less compressible, which I think is really interesting. And I think that's what was happening in cases where I've tried to manually compress JavaScript, right? I mean, even though I know these compression formats work on repetition, like I'll feel like, well, I've definitely made that smaller or I've removed, like, I've turned a function into an arrow function and that seems smaller, but yes, it hits these paths where it's not able to do the repetition as much or it means that function for whatever reason is appearing less frequently and that changes the whole makeup of the Huffman tree and changes how half the rest of the file is compressing in some way that I didn't expect. And now I know. Yeah, and if you wanna know more about compression, our colleague Coltman Analyst made a whole series called Compressorhead back in 2013, 2015, where he explained. And if you think that sounds like a long time ago that it's not gonna be relevant, remember some of these compression formats are from the 50s, so it doesn't matter that much. And I think that's exactly it. Like it seems like most modern compression are using those inventions and recombining them in very clever ways, transforming your input before it goes through one of these old compression algorithms and that's how really it makes it better. And so it's definitely worth watching. So check that link out in the description. We do the clap and then professionalism descends upon us. Is that what you call our show? A show of professionalism. It's something, isn't it? It's not professionalism. It's not that, it's many things, but it's not that.