 Hello, my name is Attila Suj. I am a software engineer at Colabora and now I will speak from Zip64 support in Libros. First, what is this Zip64 and why it's needed for us? Most of our documents are compressed with Zip. The original Zip file format was designed long ago. And as every application zips to have limitations. For example, the file size was stored in 32 bits. That means the biggest file it can compress is 4GB. It was totally okay at the time as all hard drives were smaller than 4GB. So all files were smaller than 4GB too. And it would be a waste of memory if you would store the file size in bigger variables. But as the technology advanced, it reached some of these limitations. For example, there are many files now that is bigger than 4GB. And if we want to be able to compress it with Zip, we had to extend Zip format. And this Zip64 is the extension which extends the file size to not just that. There are many other things that is extended, but maybe that is the most important for us. Unfortunately, the Zip file format was designed in the very beginning to be easily extended. So there are many extensions for Zip. And one more note. Why do we need it? Because not just huge documents can be in Zip64 format. There are small files too that is saved in Zip64 format. And Zip office was not able to open those files just because it's in a newer format. So what is these limitations? The extension is extended. First, the most important is the uncompressed file size, which was at max 4GB before the extension. And now it's so huge, we can't see when we will reach it. Maybe in the future we will reach, but never mind. So the uncompressed file size means the files which we want to compress. The second is the compressed archive size. It's similar to the previous, but it's harder to reach these limitations, because when we compress the files, they will be smaller, maybe 100 times smaller, or it depends on the file, how it can compress. It's less important for us now, because even the uncompressed file size is hard to reach for go-byte with LibreOffice, but compressed archive size would mean we should reach the uncompressed size like 40 go-bytes. It extended the file count too. It's not really important for us, because it was 64000 files at max, and well, if someone would insert a lot of small images, for example, like 100000, then it could be a problem, but it's very extreme. If someone tries to do that, he can put a bug ticket in bugzilla, but I've never seen a file that would even close to that point. Another extension is the disk count. It's really not important for us. It's from all times when we have floppy disks, and we had to copy bigger files than it fit in the floppy disk, so ZIP had a feature, still has the feature, that it can split the zip archive into multiple parts, so we can copy those smaller parts into the disks. Our document is always in one part, so the LibreOffice code doesn't even handle this variable. It just simply skips and thinks that we have one file anyway. There are even some more limitations, but they are even less important for us. The only thing that can be important for us is the first uncompressed file size. Even now it's hard to reach, but if someone makes a huge database in Cog, they can generate a content XML file that is bigger than 40 GB, which may be just a 100 MB sized ZIP archive, but still it can be used. It could not be used in the old LibreOffice. There are these data stored in the ZIP archive. Before we could answer this, let's check how a ZIP file looks like. ZIP archive builds from smaller parts. The standard name them as records. Some of these records start with a signature. If we open a file with hex editor, and we see it starts with P and key letter, and maybe we see other parts in the file where it has P and key, then there is a good chance it's a ZIP file. One interesting thing about the ZIP file that we have to start reading it from the end. At first we have to read the end of central directory record. It's a big site record, and it contains information about the size of the central directory. After that we can see back to the beginning of the central directory, and we can read the whole central directory, where there are other information about the files. The file headers have information about those files, where they start in the ZIP archive. This example that I wrote here is a simple basic ZIP archive. It doesn't contain every part. I mean there are more kind of records in a ZIP archive. Usually I just wrote here only what is important for us. And the files are in the beginning of the ZIP archive. One after each other. These files are, they can be folders directory to not just files. The local file header contains some information about the file, how big it is, what its name, the file data record is the actual file data. It may be compressed or may not. The data descriptor, it's not needed usually. It's rather for streaming purpose. There are records that may present in a ZIP archive or may not present in a ZIP archive. It's depending on other data that we read in the, we can read in from the ZIP archive. So the ZIP64 extension, let's check which record was extended by this and which records are new. The extended records are this local file header. The file header, this is in the central directory and the data descriptor. And the new records are ZIP64 end of central directory record and ZIP64 end of central directory locator. I don't implemented this last two. I implemented only the first three. When I say implement, I mean I read value from these. It does not mean I use all the values I read just like as I mentioned the discount, we don't need the data so we just skip that. But at least I read these data. The local file header and the file header has a smaller part in it, the extra field. It is present in the original ZIP format too, but it can contain many things and ZIP64 extension gives a new kind of extra field what we can write in this extra field. One important thing is that if this ZIP64 thing is not one property for the entire archive, I mean every record can be separately in ZIP64 mode or not in ZIP64 mode. It can be senseless sometimes but for example, if we have one file in a ZIP archive and its file size is stored in the ZIP archive, it can be stored in three different places. It can be stored in the local file header, it's stored in the file header and if we have a data descriptor then it's stored there too. The same data is stored in three places but we can save one of them in ZIP64 mode and save one other in the original ZIP mode. So let's check this extra field and it's parent the local file header. Local file header started signature then version, flex, compression, metal, time and date of last modification, checksum, compressed site, uncompressed site. These both are stored in four bytes so they cannot be weaker than four gigabytes. Then length of the file name and the length of the extra field then the actual file name and the actual extra field. In many cases this extra field is not exist. I mean the length of the extra field is zero file header in the central directory seems similar to this that have the same data and a lot of more too so I just draw this here because it's more simple. So the standard state that if the file size is bigger than four gigabytes then we will have to write the maximum variable, the maximum number we can write into the uncompressed site and in the extra field we will have to write a ZIP64 extra field where the uncompressed site will be present. And the standard also stays in the other state in the other directions too that if the compressed site or end the uncompressed site equal to the maximum number can be present in that field then the real value is in the extra field we have to read that from them I wrote this question mark there because it's a funny thing that the standard is not clear here I mean it just lists compressed size and uncompressed size under each other and then it start to write this statement it doesn't mention that both of them has to be this special value or it's enough if one of them have a special value but anyway in the extra field that will be both of them present so let's check the extra field the extra field is an array it's not just one value we can have many extra field data in it in general case extra fields have a 2 byte long id 2 byte long size and an extra data that is as long as the size in zip 64 case the id is 1 the size is well it should be 28 but sometimes smaller than that and there is the uncompressed size and the compressed size in 8 byte and there are 2 more fields that is not important for us now that is again a strange thing in the standard it doesn't mention that this can be smaller it doesn't even mention that it has to be 28 byte it just list what kind of fields are there and that's all you cannot be sure that what can you expect how files can be seen but as I seen some files only have 16 bytes long data from the zip 64 I mean it's enough to have the uncompressed size and the compressed size well it's a strange thing that the standard doesn't for with it doesn't mention it and the next thing is dota descriptor it's rarely used it's designed for fine streaming normally it has 4 fields with 4 each of them are 4 bytes big but here is again I'm not clear thing that the signature is there or not there the standard says normally it does not need a signature but many application write signature here so we can expect there will be a signature there or maybe not but there and the here the uncompressed size and the uncompressed size is important for us in case if we are in zip 64 mode these fields are not 4 bytes long but 8 bytes long it's again not totally clear if when are we in zip 64 mode I decided to get this information from the file header in the central directory because in the local file header the file size can be another special value in some cases especially when data desktop is used so the standard is designed to be well expandable it's a very good thing that there are many spots in the standard in the zip archive where we can insert our own special data if we want and if we do it in the right way if we can read it back to it the standard doesn't really forbid a lot of things it's written in the name to they allow almost everything to us what is possible I think they designed it to support anything what we can imagine it even allow senseless things like I said that we can store file in half in zip 64 mode and half in not zip zip 64 mode but well if someone is stupid then it's his own fault this format is commonly used there is many extension for example the extra field that I mentioned there are many type of extension there many companies need their own extensions and it's very complex there are many special cases it supports like encryption compression stream split spell self-extruct and even this can be very complex when I said encryption well we can choose to encrypt all the files or just some of the files we can choose to encrypt the central directory to hide findings from other people and we can even choose encryption methods how to encrypt files and the standard is not exact it can be a problem sometimes when we use standard we are not sure what we can expect what we have to try to read it was funny to read in the standard that originally there was no signature but many people use it so being prepared for it and well that's the hard part of it it's hard to prepare for every use case we have to try to imagine what other people would try to use especially the special cases which we would think that it's a stupid example but but sometimes some applications do it anyway I mean there are applications that save zip files in zip 64 mode even if it's a small file and we have to prepare for them and the test case I wrote a unit test for it with a small zip 64 file just as I mentioned before that some application took that but it was good for us but for the real test to make a zip file that has bigger than 4GB file inside of it it was even a challenge even to create such a test file and I was able to test it only manually because unit test has to be fast and it's really slow it was several minutes just to import such a file and when it was in when I was in debug board it was like 40 minutes so it's not usable that way and about the future possibilities I only implemented the way to use the uncompressed size and we could implement the compressed size limitation too if needed it's partially implemented because I load this information and I changed this value into 64 bits but it's not enough there are some other parts that store information related to it we should load the new records I mean the ZIP64 and of central directory record and locator and the other hard point is that we should make sure that all the where this information goes through there will be 64 bits compatible I mean even if there is just one variable that is smaller than that just a function parameter or something or what's return if we copy a big value into this variable then it will be crop and probably the LibreOffice will crash I had to fix many of several of these problems why I implemented the uncompressed size and if anyone want to check these I visited some code pointers here some functions which read the central directory or write the central directory and which uncompressed I think that's all thank you