 So welcome everyone to the Let's Drink the Package Archive talk by Hideki Yamane, and here you go. Thanks. Thanks for coming, everyone. This is my first talk at Dead Con, so I'm a little bit scared. Thanks. So this is today's agenda about the size of the depository and to solve the problem to improve the size. And anyway, the step. Many of you know that Debian has many, many, so many packages that support many CPU architectures, and some, the Linux and the FreeBSD and the Herd kernel. So, does anyone know about the size of Debian whole depository? Anyone? About 665 gigabytes. Yes. Yes. Yes. True. Yeah. It's so huge. And I think I can improve this size. And I think I can. We can. Yes. We can shrink this and some ways drop the support architecture or delete the package. Yes. We can. No. No. No. Yes. You say? But most of you know don't like the search, the delete solutions. Don't drop the architecture or don't delete my package. So we should search and found another way to improve this situation. Many of you know the discussion is ongoing and Debian development mailing list to use the exit, XD archive. Now we have the default compression is GGIP and XD can reduce the size. Yes. The, for example, I have the Japanese font package funds for my font by GGIP and the compression maximum, the about 40 to 13 megabyte and use XD. It shrink the 25 megabyte, not gigabyte. So it's effective, really. And plus, if you use the XD, the maximum, it compresses. You can see that? Yes. Almost 6 megabyte. And I already put it to the Debian repository. Yes. But there's a warning by the Debian development announce mailing list. Yes. Please only use the XD if your package really profits from its usage. Search, it's many, it may be a problem, especially bad problem on slower architectures. Yes. Yes. Yes. It's a warning. Yes, slower architecture is a problem. It will eat many, many CPU time. So, for example, the MIPS, if you without the thinking, we will apply the XD on MIPS, it miserable result. But I think the slower is bad, but the first architecture is good. So, the powerful, the many of you are using the Intel or AMD architectures. Yes. So, I think the use XD on Intel AMD architecture is good. So, I try to decompress the whole Intel and AMD architecture and all architecture, architecture all, and this shrink the compress with the XD. The before we applied the XD, this is the size of the each archive. The architecture all is 57 gigabyte and the MD64 is 55 gigabyte and hard and K3BSD is the such size. And after applied the XD, yes, yes, we can shrink. How much can we shrink it? Yes, about 100 gigabyte. So, the numbers, the whole difference is the 104 gigabyte and the reduced rate is 35%. And I got the log file from the one of the JVMirror. I use the FTPJP Devian. It's used kind of CDN system like CDN Devian. But most of traffic goes to the one host named FTP just ACJP. And the log line is the many, many log. And I analyzed which architecture was downloaded. The Intel and AMD architecture and the all and the source is the most one. So, total download traffic is the 83 terabyte. Yes, yes, yes. So, if we'll apply the XD, we can cut 24 terabyte download traffic. It's benefit for mirror admins, if you will. So, and there is the download speed issue, I think. So, several countries have the slow download speed. I live in Japan and download speed is 1,364 terabytes. And here, NICLOG, 180 kilobytes. So, some countries good download rate if you live in Korea. Yeah, so fast. We should go to Korea, yeah. Or, Eastern European countries, there. Romania, Bulgaria, and, sorry, Latvia, and there's a good rate, yeah. And United States, almost 600. And Germany as well. And Japan is a little fast. And my result is almost 6 megabyte. Yeah. Good. And NICLOG, I said, yeah, it's not so fast. Yes. And world average is almost 600 kilobytes. So, if you want to build the huge package in NICLOG, it's cool. You should wait a lot of time to download. So, XD can cut the download time. It's benefit for Debian users. And, of course, Debian developers package maintenance or so. So, the size is almost 600 gigabyte. And if we use the XD, we can shrink that. And I think it's effective, the number says. The problem on slower architectures can avoid we will apply the Intel and AMD architecture plus architecture all. And we can cut the 100 gigabyte archive and we can cut the 24 terabyte download traffic. It's benefit for mirror other means and Debian users and Debian developers. But there is a trade-off. We should think about the decompression. The test machine is the Intel Core i5 and many memories, usually I used as a desktop. This is the test. Just extract with the TAR. The one is use the GGIP and the second is use the XD. The result is the XD is faster than GGIP. So, I tried to retry the 100 times and the result is the same. The XD is faster than the GGIP. But it's rare case. It's rare. And I tried to do the biggest package in archive. The open clipart is the biggest one. One package equal 600 megabyte. Yes, it's true. So, I tried to simple shell script, extract the data, and it takes 10 minutes. And if we decompress with the XD, it takes... Can you see? Can you see? Wow. We should wait one hour. So, almost 7 times than the GGIP. And I tried to compare the architecture all package. So, next I tried to check the not architecture all package, the Linux image package. And this is the result. Almost 2 times than the GGIP. The XD is slow. And already the Linux image package use the XD and other architecture. Sorry. I didn't check it. So, installing package is not only the extract the data. So, I tried to really install the package with the DPKG. The good case, my font package again, it takes the almost same time. It's good case. And normal case, it takes the 2 times. But the difference is, I think we can ignore that. And come in the worst case. The open clip at PNG file, it takes 8 times. And if you install the normal package, it takes 5 seconds. But use XD, it takes 40 seconds. So, actually the compression is slower than the default GGIP. Really faster than the GGIP. And usually takes the 2 to 8 to 9 times than the GGIP. It depends on its data. And the other architecture is, I've not checked. And from the log file, download log file. Some package is so many times downloaded. The Linux image package or open office, now it's a liberal office. It's a lot of download. Yes. And JLBC, Teflib, Evolution, Assembler. It's total download size is so huge. And some package, Kerberos, Caps, Mono, Bind, Abahi, a lot of time downloaded. So, in STP, JP, Debian, OGG, most top 50 packages ate all of the traffic. So, I think the apply XD to the top 50 packages the first step. And the next, modify the DevHelper to apply the XD by the powerful architecture by default. Then, maybe, modify a little demo or mass rebuild or so. So, again, the size is 600. If you use XD on the Intel and AMD architectures, we can cut the 100 GB archive. And it also cut the 24 GB download traffic per year. So, I want to apply the XD if we can. Yes. My presentation is end. So, anyone has a comment or thought or question for that? And please speak slow and clear. I'm not good at English. Okay. I've got a question. Can we have both... Can we have... Can we have both GZIP and XC? Both? So, people can choose their preference when they download. So, people who care about speed can download the XC. And people who care about their CPU performance, like if they're repetitively building things with their bootstrap or something, they can download the GZIP. That just makes it worse. There's no point whatsoever doing that. If you're trying to apply XZ to make things smaller, providing it as well as GZIP is ludicrous, surely. Yeah. The mirror will be bloated, totally. We already eat, as you've seen, two-thirds of a terabyte of disk on every full mirror on the planet. We don't need to eat even more people's disk space. Okay. Thanks. So, what do you do with architecture or packages? When you compress them with GZIP, you don't get maximum advantage on fast architectures. And when you compress with XZ, you pay too much on slow architectures. Please repeat the question again. I want to think about their question. Okay. So, what do you suggest to do with architecture or packages? Compress with GZIP or compress with XZ? I think the architecture or packages, most of them build on your machine. It's inter-MD architecture, the powerful one. So, I suggest to apply the XZ for all architecture. But then the problem is that decompression will be slower on slow architectures. But I think the big package, in architecture all big packages, most of the fonts are like the creeper for the desktop, I think. It's not a problem on slow architecture. If you want to use MIPS for the desktop, it's a problem. Yes. I use ARM desktops. Is that a problem? But more seriously, as a maintainer of a slow architecture, I don't actually wildly care that XZ is slower. This doesn't bug me. What we do need to watch out for is packages like OpenClipArt. I highly recommend that de-package needs a no-compress option because that package shouldn't be compressed. The PNGs in it are already compressed about as far as you can go, and you can tell that in your comparison between GZIP and XZ. I'll bet you that GZIP versus a raw tar is still almost the same. So, that solves a lot of those problems if you can identify packages like that and just not compress them at all. So, there's a problem on some packages, I think. So, if we find a problem on that package, we change the compressed option. So, by default, we use the XD, and if we find a problem, we change the GZIP. The thing you forgot to tell us was how we changed the compression method. I've discovered there's a post from Raphael with the answers in, but that's a useful piece of information. How to change the compressed option? Just modify the Debian rules file. It needs only one line. Yes, like the... Yes. Oh, sorry. Wait, wait, wait for a minute. Yeah, there's the option, the DH builder, the use the XD option, and the extreme option, the compression level is 9, the maximum. So, now we should specify the XD for the compression option, but if we change the default to XD, we don't need to specify the XD. And like the open creepout package, after we apply the XD by default, we change this XD to the GZIP. What's the impact on the builder machine? How much more CPU are they likely to require? Probably yes, probably yes, but it's a trade-off, the download time, download traffic, the archive size, and the CPU time. And listen, but if we apply it for every package, even for those who don't really gain from XD, it doesn't make sense. On one of your slides, you said that basically by modifying just a few packages, you reduce 50% of the download for your end users. In a way, it's actually answering your question. Just thinking, rather than having a no-compress field, why can't the packages concerned actually use the DH-compress options that are already there? Why aren't packages like these actually already used in the existing support? You've got control over DH-compress in your depth helper parts of the rules file. You can specify which files get excluded and which ones get compressed. Why aren't packages using it? Using it already, if it's a central problem. The open creepout one. So at the moment, no, the package does not have, at least to my knowledge, support for an uncompressed table, data.tar. So you can tweak the options to DH-compress all you like, but it's not going to help. The point is, as Adam said, and just before I was going to say the same thing, if we could for just a package full of pings, we really should have an extra option to say just have a data.tar included. Yeah, exactly. And the thing about DH-compress is that it compresses the file inside the package, and you don't necessarily want to have PNGs compressed if the things that use the PNG cannot use them compressed. Right, and as far as the build-damant impact thing goes, a lot of that is solved by not passing dash Z9. The default XZ compression of 6 is actually fairly well behaved and won't cripple a build-d with half a gig of RAM. Z9, you can sit there building a dev for about 6 hours. It's really entertaining. It's for the microphone package, and I have a lot of RAM in the memory, so it's not a problem for anyone, I think. If you want to rebuild my package on the slower one, it's a problem. But now we don't need to rebuild the package I built in my machine, so it's not a problem for you. I'm just reading the package dev man page, and dash Z-compress type accepts option none, but maybe it's the policy and the archive which doesn't accept uncompressed devs. Okay, I'll just play around with it locally then. Yeah, just two things very quickly. Just for the uncompressed tar files of things. The tar files don't actually have a checksum, as far as I know, whereas anything that's compressed with gzip or bzip2 is checksum. There was some mention of the clipart, not having... No, is that not true? A dev file is not a tar file, but for source packages, they do have to be compressed with something with a checksum. It can't just be... But the other thing is just on the issue of whether we have both compression mechanisms. I'm not thinking about this from the point of view of the person paying for the disks or the bandwidth. I'm thinking about it from the user's point of view, giving the user the choice what they want to download. I'm not saying that we should be biased towards giving the user that choice, but that's the reason I raised that. It's purely looking at it from the point of view of the user. Along... Yeah, the subject of gzip as well did come up in the conversation, which may have been slightly wrong field around the pool. And someone suggested that it might be possible to provide a checksum for the gzip that you would have created if you were creating one. Such that someone could then apply some magic locally on their site to recode certain packages for their very slow machines. I've no idea if that's technically possible, but if you have a predictable way of getting gzip to run, then you could put the checksum in ready for them, and then they could do something to munch their local repository. But yeah, I don't think we're going to do it. Yeah, Phil, that sounds utterly insane. Well done. Going back to one of your end slides... Yeah, that one. Exclude priority required. Why? So... It's a little bit... It's all the past. We should exclude the priority required to apply the xd. So now we can cut this line. Yeah, I mean, there's been some discussion already, as I'm sure you've seen in terms of trying to fit things on CDs, which is my area of Debian. And so we've already been talking about using xz a lot more. The thing that comes out of that is apparently Debootstrap. People don't want to make it have to depend on xz support. The only argument I've seen for that is that we don't want to cut off people who were trying to run Debootstrap on a non-Debian system. I'll be honest, I couldn't care less. I think we should just say xz by default and people wanting to run Debootstrap elsewhere will hear some xz programs. The last idea about that I have heard was that we could patch Debootstrap to download xz and use that if it's not available on the native system. So that would require that only xz and the libraries, it depends on our Compressed with Gizop. That would be, I think, three packages. How does that work on non-Debian systems? That's the issue. Okay, for a random Spark binary or that kind of thing? No, okay, a random Spark Solaris binary. You just have to say on Debootstrap to propoxz and if it isn't available on the system then error out and tell the user it's the only same thing to do. It was just covering the same point. You can only do that if the thing you download as an architecture you can execute on that end. Debootstrap is our chore for a reason. If you're using it on an architecture that can't run or isn't supported in Debian then you couldn't use that option with Debootstrap. The issue with Debootstrap is we also try to encourage people running on weird and wonderful, say, Red Hat systems and whatever to run it to. And that's even more of an issue. Well, first of all you don't compress xz package with xz. And second of all, why can't we provide xz static for those who are Linux-like machines, same architecture such that they can run and do their Debootstrap. We can distribute xz, right? Because that will cut out most of the use cases for Debootstrap on foreign systems. And that one guy who cannot run Linux-compiled xz static provided by Debian, then well, you know, what are the chances that Debootstrap succeeds or he can get xz himself then or herself. Because we already provide loads of uDebs and Debian installer and other compiled static things. So xz is one of those that is needed to get Debian running. But of course the rest of the stuff is already within the Debian archive and is checksummed and signed and everything. We don't necessarily want to be providing random static binaries of xz to people. You know, we have no idea about the systems they're going to be running on. You're suggesting that we could just provide it as a deb that is compressed with gzip and then people can grab it from wherever. The complexity involved I think is unwarranted. We just say to people, you know, you don't seem to have xz, go install xz and come back. To me anyway is much saying and much simpler. That's all. Yeah, QMU static is a way, way more complicated than we want to get into here. Okay. Require seems to be 59 package on my system. Is it worth... Require package, it's just 59 packages on my system. Is it worth preventing people? Well, of course, but they're on every single CD we ever ship, on every single DVD. They're on every single installation of every Debian system ever. That sounds like a reasonable set of packages that we really want to compress well. You know, it's the ones that... They may not be large, but I think it would be sensible to, you know, to go for numbers. Yeah. I don't see any reason to be honest that we should be special casing them. Anybody else? Do we know how much we're saving those 59 packages actually gain from... Because presumably they're all the small stuff that we require. They're tiny packages, aren't they? So maybe it's not worth it compared with all the other stuff. Maybe. We don't... I don't know off the top of my head. Anyway, what we do know is, I mean, Ansgar has already been looking into using xz for a smallish call of packages already and we can save a lot of space to make CD1 work for KDN known. I would love to be able to make Debian smaller so we can actually have a sensible set of CDs and DVDs. This looks like a cool way of going. Maybe one other thing I just wanted to add is that the point of us doing a binary distribution is to spend much time on the abilities. So if our compression is expensive, we still do it once and not once per user. So we don't care. UDEPs are compressed with xz by default now. So why not do priority required as well? It's another step towards that end of the argument. For UDEPs, we can ensure that there's always an xz decompressor available, which we cannot guarantee for our deepwood strap on non-Debian systems. So that means... Yeah, that's fair enough. The other thing I wanted to come back to again about the compression and whatever, again, as another porter on one of our slow architectures, is really, if it takes longer to compress, we can find more buildies. Really, that's not going to hold us back. Any other questions, Tom? I think for a great discussion, we all should say to Yamane-san, Arigatou gozaimasu. Thank you. Oh yeah, thank you very much. You're welcome. So, it's a time to call this BOF. If you have any comments or post to the DebianDouble mailing list, if you have any blame to me, the email in private. Thanks for coming to the DebianDouble mailing list. Thanks.