 I want to talk a bit about delta upgrades today. So why would you want to use delta upgrades? If you're an end user on a slow connection, obviously, if you have delta upgrades, you have less to download, and it's faster. If your connection is capped, you have less to download, which means you hit your cap less often, or later in the month, or whatever your period of the cap is. For mirrors with many clients, it's obviously an advantage because they have less traffic from upgrades. And for the archive, I don't know. And for me, it's just fun to write the code. So what about depth delta? I would say it's complicated. So what I mean is, you can see that in the next slide. There we go. This is a depth delta. That's the existing format that is around since a few years. It consists of an info file, which contains some metadata, and a set of numbered patch files, and a script, and an inline signature. And the problem with the depth delta format basically is the script because the script could do anything. You can't check that the script is safe. And it's not a good idea to run a shell script just to regenerate a depth from a delta. So we can improve that. Let's start from scratch and say a delta depth is a depth of deltas. And what I mean with that is essentially what it sounds like. So this is how my idea of a delta depth looks like. It's essentially a normal depth file. With the difference is that in the data table, we replace the actual files with deltas against the old files. And for conf files, we basically store delta that contains the entire file. So if we apply the delta for the conf file, it behaves just as extracting the entire file does. And we can add that into dPackage easily by changing a single bit in there that writes the new files to the disk, which means that we can just dPackage i it. So it just installs like a normal depth. And each file delta is represented using a bsdiff variant. So on the left side, you can see the normal bsdiff format, which contains a header followed by control blocks. The control blocks contains a number of bytes to read from a diff section, which is lower at the end of the file, an extra section. The extra section is a bunch of data from the new file that's contained in the diff. And the seek offset is a number of bytes to seek in the old file. And the diff data is gsiped. So the delta is relatively small, but gsip is relatively slow compared to faster compression algorithms. And it compresses worse than better compression algorithms. So it's not an optimal solution. Instead, I reorganized the whole file a bit. So I put the diff data and the extra data directly after the control section that controls it. So you can now read the file sequentially, which is a nice improvement because the file gets sent to dPackage via a pipe, which means we would have to keep the entire delta for the file in memory to be able to seek in it. In the old format, we would have to seek and be as stiff because the gsip data and the extra data are at the end of the file. So we would have to seek to the diff data section. When we read the diff size, then we would have to seek to the extra data section to read the extra data. And with the new format, we can just read the header, which the per entry header. So we read the diff size, extra size, seek offset triplet, then we read diff size bytes from the diff data. And what we do with the diff data is we add the diff data to the old files data. So it's just a simple sum of bytes because the diff is essentially the new data and minus the old data, which means that data that is unchanged essentially is a zero, which also means that it compresses well with a general purpose algorithm. And in this format, we don't compress the delta itself. So the delta itself basically has the same size as the new file, just with a lot of zeros in it for stuff that hasn't changed. But then we can later apply the compression in the dep itself. So we have compressed the entire tar ball of deltas, which gives us a better compression than just compressing individual deltas. I can use any compression algorithm depackage supports like XZ or ZSTD or G-SIP or B-SIP2 if you want to. I wouldn't know why, but if you want to do it, which means that we have a much nicer format. We can see a few requirements here when we compare BSTF against D-Delta. We see that D-Delta requires about half the memory of BSTF when generating a diff. It's still linear to the bytes in the old file plus the new file. But when applying a D-Delta, we can apply the D-Delta with constant memory usage compared to linear memory usage, which seems to be a huge win. And we require random access on old files and new files when generating the delta because we have to seek in them. But applying them is easier because as you have seen, probably from the control data, we only have a seek part for the old file so we only need random access for the old file and can sequentially read the patch file from the pipe and sequentially write out the new file which should be relatively efficient. Probably mostly on SSDs, on HDDs, might have some problems with caching and seeks on the disk itself. So there's another problem with deltas because we install the deltas directly to the file system. We need to make sure that the files we have on the system are actually the files we expect to have. So my idea here is to introduce a package ID. So basically we just hash the list of files and their hashes and then we get a hash of all the contents in the dep and then we can check whether the files on the system match the dep, match the hash we expect. And if they do, we can fetch the delta and if not, we just fetch the full depth instead. Or we don't actually have to check that the files really match. We can just store the ID in the dpackage database and then just do the lookup based on the ID for the delta. The actual reading the files and verifying that they match the package and that you haven't modified them locally is optional, I think, but might make sense if you like a modifying files in the user partition. And if you want to find a delta, what you could do is you could have an index of all deltas like a packages file, but it turned out the delta index would be too big because the delta index would need the file name, the hash, and the size and probably the package name and the version two. And the problem with that is the hashes don't compress well, so you end up with a file that has basically the same size as the packages files when compressed. So I think the better approach is to go the depth delta away and use a consistent naming for the deltas so we can just try to download the delta. If it exists, that's nice. We have to download less. So we just add an item that says, hey, I'm going to download the full depth. So I just expect to download all of the depth. And then when I see, oh, there's a delta available, I just say, oh, yeah, we have progressed much more than we should have. So the progress just jumps up a bit, which is fine. It's not a very accurate progress reporting, but it's a nice surprise, I think. If you see, oh, there's two hours to go for downloading the updates, and suddenly it's only two minutes. I think you're happy and don't care that it's very unreliable. So signing the deltas, there are various options for that. We could provide a detached signature. We could embed the signature. We could even clear sign the delta. But these mean we have to run GPG for each delta, which might not be entirely efficient. And it also means we have to implement delta verification specifically just for deltas. So one thing I was thinking about is to have an in-release file, which is essentially a list of hashes, sizes, and filenames, but provide an in-release file per source package or per binary package, and then just download that. Check in that file, is the delta available, and then download the delta, which means we will have to download a lot of small files. But that should not be a problem. It's because I think most mirrors do support pipelining acceptably well, because otherwise, Peters wouldn't work either. So it should be plenty fast, and you don't really notice the overhead of having to do a lot of small file requests. And we just have to check if we want to do the files per source package, or per binary package, or even per version. So let's talk numbers. What does delta bring us? First, a simple evaluation we did was upgrading stretch all the way through the point releases. And we can see on the right side, we see upgrading just with depths. And on the left side, we mix in deltas. And we can see that the upgrade size reduces from 1,165 megabytes to 240 megabytes. That's for Knome life image. So we save about 924 megabytes, or about 80% of the download size, which I think is really nice to have. Compared to depth delta, we see that it performs about the same. So for actual user installations, I would say, like the life image package sets for Knome KDE XFCE, our new implementation performs a bit better. On the average in the whole archive, it performs slightly worse. So we have 60% to download of the normal depth size, compared to 50% with depth delta. I haven't yet figured out why precisely. So I'll do some more investigation there. But I think it's very close, so it's not really a problem. And it's nice that the more common cases of user has installed a desktop, work just better, and about the same as depth delta does. So finally, how much is the size increase? We're looking at 11%, I think, about 11% of the size of the depth. For each depth, we want to generate the delta against. So if we provide a delta for the last point release to the current point release, we increase the archive size by 11% for every updated package. And building the deltas, I did it on a 32-core Google Cloud compute engine instance. And it took me about 20 minutes per architecture, which I think is OK from a time perspective. It's probably a bit slower on an 8-core machine. I'm not sure, like, a few hours maybe. It should be fine, and you don't have to build the entire point release deltas just before the point release or something. You can just build the deltas directly when you're building the package, or you could build the package asynchronously. So that's stuff to talk about and figure out how we want to build deltas, how we want to ship deltas. But I think it's a worthwhile approach for deltaying. And yeah, that's it from the talk side. Do you have any questions? There's a mic. You can go there and ask questions. Production ready, would you say? All this is. OK, so it's mostly prototype level at the moment. I have a working deep actually that can install deltas, but it's not really good code. On the app side, I don't have anything yet, because we haven't figured out the entire archive format yet. And the deep package side, basically, then will also need acceptance on the deep package maintainer, of course. So yeah. I hope our plan is to work on that so we can ship it, like, in April next year, but in Ubuntu, probably. And not sure about Debian, because deep package maintainer. But yeah, that's my timeline, basically. Could you go back to the slide with the charts, please? This one? The next one. So that last yellow bar is really close to both of the delta deb and deb delta bars. It's logarithmic. Oh, it's a logarithmic chart. OK, thank you. I should have mentioned that, because the problem is that the full size is just too big compared to the deltas. So I had to use a logarithmic scale. Can you go to the previous one? The first, well, actually, it's a second entry. It's quite similar on each side. So I'm wondering if that shows that some types of, OK, what was different about this point release that made it compress worse? And the question is, are there some types of upgrades which don't perform as well in your algorithm? Do you know what those are? Well, if you have compressed data, it probably performs worse. If it's non reproducible, it generally performs worse. And if you have major version upgrades like Firefox or Chrome, they might produce a big delta, like 60% of the full depth size. And we just drop the delta instead and use the full depth in order to not blow up the archive size too much. But I don't know what package specifically cost this. Anyone else? It's only for stable. So I just tried it with stable here. You could probably also run it unstable. But with unstable, you have the problem that you might have four package updates in a week or something. And you don't know how many deltas to generate. So that's something to think about for unstable. For stable, it's easier. And you have less deltas to generate because only a subset of packages changes. So it's OK for the mirror archives just for stable. But for unstable, it's too much for the mirror repository, you say? It might be. I don't know. You don't know. OK, I take it. I use a sneaker net with this little disk to put in all machines. So is there going to be a version of aptitude that's all ready to go for this thing? Well, aptitude would work just. It uses the normal apt stuff. So you could just download the deltas with that. But if you have different versions installed on different machines, you want to update them to the same version. You might need different deltas for these machines. So you might just want to disable that and use full depth instead. Just to follow up on what Hideki said, if it could be, if something sensible could be come up with for unstable, that would be great. Because one of the pain points of running unstable is how long you have to wait for stuff to download. Yeah, I agree. I see that it's a harder problem, but it seems worth doing. I think if it's just 15% size increase for the archive, it's probably worth it. OK, final question now. If someone suggested opportunistically generating deltas when a user requested it and then caching it on server for a period of time, or at least recently used or something like that, would your reaction be that's interesting or total horror or what? I think it's more complicated. And the first user has to wait very long, potentially, because it needs to be compressed. And the delta generation is relatively slow. But what they usually do is the first user requests the delta, and then it gets back no delta. And then the next user gets the delta probably, which I think might be fine, but it requires us to have a centralized web service, and it doesn't mirror. You can't use mirrors with that, which I think is bad because we want that mirror to. OK, that's it. Thank you.