 First testing and fuzzing 2023 report, I work for Collabra now and I'm very pleased to have the opportunity to continue working on LibreOffice in that capacity. So this is in a long series of repetitive talks I have. The three topics is these long-term campaigns we've been doing for a couple of years, for code quality, security improvements and just general improvements and stability of document conversion and the like. So let's cover our coverty work, our crash testing stuff and then our fuzzing. Okay, the code quality and checking where things go wrong. We run this locally, I run it locally on my own hardware. We upload it to coverty stuff. They process the output of the compiler and then they just flag all of the potential issues that they can statically determine. The results are public, they're on the URL there. In the past you had to be a member, you no longer have to be a member to see the results, you just have to have a GitHub login. We have configured it to scan our C++ and Java code in LibreOffice. And we use this particular configuration at the bottom there and the only results we have in our public coverty results are for our own code, even though we have a lot of external dependencies, we're only scanning our own stuff. So that is one area that might be of concern going forward, is that it would be great if other projects would also do the same on their code. Just an example of a common warning that comes up, the kind of mistake that people make is they add a new member to a class. That class has got two constructors, they know full well that it needs to be initialized, but they only initialize it in the first constructor, they see and they miss it in the second constructor. So it catches all these common kind of errors and it catches typical issues that arise in very, very long functions where somebody comes in, they add something to an existing long function. They want to be careful, they null check it, a coverty runs, a coverty reports back that while there's a new null check added there that it wasn't checked in the other 400 places that it was dereferenced in the same function. So there's no point adding that check in the new code. Or vice versa, you've added something and you did not null check it, a coverty points out that it has been null checked previously, so there's something unusual or different about your code and you really need to review that. Well, coverty is not perfect. These are how you can mark code, annotate code to get coverty to ignore it for whatever reasons. There's three patterns that this is, coverty is correct in what reports, but it doesn't matter in this case, coverty is incorrect about this, and then you can either have it marked false in the user interface or you can have coverty ignore it completely and this is the markup for that. Right now at the moment we have quite a lot of standard optional used, standard option and seems to be something that's unusual to be used in most other code bases and we have a lot of it now, coverty doesn't really understand, standard optional, so you get regularly uninitialized payload warnings, or uninitialized variable on payload member warnings that are coming through in recent reports. So you can just ignore those, we're just dismissing them in the user interface for now and we're hoping that the next version of coverty will get this right. So what does this look like over time? Hello, we've checked on Jitsie, are you sharing your microphone? I certainly gave it some kind of a... Unmute, share your screen. Sorry, but this now is... Because Michael Meeks is trying to hear. You should have come, shouldn't you? So just share your microphone. I think I have come. And turn off your webcam and share your screen, then taken. I have to pick the external screen here now. Okay. Great, thanks. Double check with that again. Yeah, here was it. Yes, defects over time. Now, we've been at this quite a long time and we just like to keep these numbers nice and low, the lower they are. It is easier to maintain something when it is quite low, when the new results come in, they're obviously unusual and they get fixed. When you have a high degree of them and you have to just like scan through thousands of false reports, every time you look at them, then nobody pays any attention. So you only weigh any of these long-term things work because if the numbers are really slow, really low. If you go from zero to five, people pay attention. If you go from 995 to 1,000, people pay no attention. So it's important to try and keep in top of them. This is the chart for relative density, defect density over time. The red line is what they always claim is the average density of a free software project and the bottom is our own one as we bump along zero for many years now. The most recent screenshot from the analyst metrics is there just from the beginning of September and back down to zero again, which is where we like to be. Yeah, so that's severity, what we do there. Crash testing. Crash testing is where we have a big pile of documents. We regularly import all these documents. We're now up to more than three-quarters of a million documents. We import them all in. We find these documents in the first place from these scripts. We've parsed bugzillas for a lot of them. And now more recently from Cisco's work, we pull down a whole pile of documents from various internet forums. And we have half a million Excel documents from that one particular forum alone. We import them all in. For most formats, we then re-export them all. And we report the failed imports and exports there. Generally crashes and asserts. When it does crash and assert, we get back traces from them all and we upload them all to a particular location and we report on the mailing list where that location is and how many crashes there are. Kind of a quick look each week. Takes about four days to process all these documents. So you get one report generally about once a week. Where's the current status on that? The most last recent numbers I looked at there was that we have 13 remaining issues. I broke them down into these categories here. We have some issues in asserts, in parallel calculations and spreadsheets. And the rest of them are all then asserts in writer itself. None of them are obvious crashes. They're all asserts. I think the calc ones probably are indicative of a potential crash or potential incorrect results. And the rest of them then are probably just more semantic. I think Mike may have fixed some of these asserts most recently and there was another email report that said one of the other issues is done as well. As against that, there is an additional issue with Arman's item sets. He was there a moment ago, so we'll get that sorted hopefully soon and we get these numbers improved. The next thing is new from the usual conversion here. What we've been doing recently in collaboration is that we have been doing Microsoft Office regression. We've been taking these particular documents that I described and we have been exporting them and then importing them with Microsoft Office to see how well our export is doing. Are we creating documents that are passable with Microsoft Office? The first time this was turned on and used for real. It was quick here in the little circle at the bottom. It was quickly found that some of our recent work for team export had caused an issue where we were creating documents that we could re-import without any failure but when actually imported by Microsoft Office itself was causing Microsoft Office to fail, so a good catch there. We need to extend this and be more regular in its reporting. I don't know if I want to go through this in great detail but just to state the kind of process for the remaining bugs, what I found for the last set of issues in calc for parallelization is that the issue was that for these parallel calc calculations we take a look at the formula, we find out what cells the formula depends on and then if we are able to calculate, we go ahead and we do this parallel calculation but if the parallel calculation then finds out that some of the data that it depended on was not properly available, we get these asserts saying that the original logic that said the parallel calculation was possible was incorrect. Debugging into it further, I found out that there's a couple of quirks with our lookup formula in calc and that was the reason that it was asserting and that in Excel itself its lookup calculation has much more constrained than what it uses than the calc one. In LibreOffice calc we extend in various circumstances the range of cells that we are provided to, we extend them in different directions so that we end up using data that was not described directly in the formula as input into the result. So this is the reason that the assert was firing and the reason that these 13 or 14 parallel documents from calc were failing. So a very good catch there, the original assumptions were incorrect and that resulted, we just constrained that particular calculation to just working with well-formed lookup functions that fit the criteria that are suitable for parallelization. So there is one remaining calc parallel issue to be solved, so perhaps it's something of a similar nature. Third category then is fuzzing. For the caverty we build locally, we send it up to the caverty servers and they process it. For the OSS fuzz it's the opposite scenario, it gets built on their side so we can figure things to let Google know how to build our stuff and they build it on their servers and install it on their side. This diagram here is a rough indication of what parts of their infrastructure are used. Built on our side, we build it twice a day, we have about 50 different fuzzing targets, 39 different file formats, basically Microsoft Word, Excel, our own file formats, a lot of these very obscure formats, HWP, Hangul, word processor format and Lotus Word Pro, all of these ones have a fuzzer for import. Google's fuzzing then builds on these three or four different types of engine and then these different types of configurations, the address sanitizer, memory sanitizer, unbefined behavior sanitizer. So all told our 50 fuzzers get turned into 200 different fuzzing processes that are run 24 by seven every day of the year and rebuilt twice a day. The configuration for that again is in our own git so you can examine what it is there. Unlike our usual scenarios, it's statically built so there's no dynamic libraries, which is something that is recommended for the OSS fuzz. This brings it closer to the IOS target that we have than it is for the normal desktop target. And we run it throughout a configuration layer and we do a whole set of other things that are specific to fuzzing. What turns out with the practical difficulty in fuzzing is that the timeouts in something like a verity you get false positives for static analysis. When you're dealing with fuzzing, you basically get kind of false complaints because of timeouts. It's about 30 seconds is how long you get before it just says that your process is home. So you have to keep things to go under 30 seconds. Sometimes we have had plenty of issues where things are hanging and that is an actual boat to be solved. But there's always some sort of a document that will not get processed in 30 seconds for quite legitimate reasons. We have an awful lot of file formats such as GIF and other ones that have built-in compression where a very, very small input can be legitimately expanded to an incredibly large output which cannot be processed in 30 seconds. So you can limit your input size for the document to be processed but some of them have this practically infinite decompression support. So again, we have to do a little bit of hackery there to limit how the maximum decompression rate of a document is so that we can be sure that we're processing documents at a certain size, that can be processed within the 30 seconds which is just a lot of kind of workarounds to get this fuzzing effective. You only get one timeout reported and it says fuzz per target. So if you solve a particular timeout you can be fairly sure that within 24 hours you'll have a new timeout that replaces the last timeout. So it's been a very, very long process to one by one take out all of these places where we get stuck so that we can be effectively processing stuff to find issues that we're really interested in. Simulator timeouts is out of memory. You also have the same kind of issue that there are too many loopholes and too many of the file formats where very, very small input will consume huge amounts of output. Some of these are documented in say LibJPEG Turbo has a document on how the specification for JPEG has a few loopholes in it where tiny input will eat huge, extraordinary amounts of output. So if all of the recommendations there you can limit the JPEG memory with this incantation and similarly for spreadsheets you can create tiny spreadsheets that create best amounts, that consume best amounts of memory for matrixes. So again, you have to go through a large amount of steps to avoid your corner cases so that you can do things effectively. That's been running for quite a while in the past. What's new this year? Those are all import filters. That's where we've been focusing on reading in documents and seeing if they fail in one way or the other. More recently I've been concerned about, I suppose maybe our PDF exporter exporting documents. So I've been looking into improving or fuzzing to try and support that. But to export something you have to import it first and seeing as we're already importing something in all these filters. We're not that interested in repeating the entire process of fuzzing the import. We want to create documents that are exploring the export. So what we have so far is this first flat ODT to PDF one not interested in looking for import crashes and looking for crashes that happen at layout or when it's exported to PDF. Top here then is an example of an existing fuzzer target. Basically you're just given this signature here and you have to do something with the data which is of lead size. So normally you just put it into a memory stream and you call something that tests the document that goes off and loads it. It has a return code. When you look at the documentation for the return code basically if it returns zero it keeps the document that was fuzzed and decides that this is a good document to put into the document corpus that it will reuse when it fuzzes in the future. So the documents that come back from this with return code zero are kept by OSS fuzz to use as documents to manipulate in the future. So if any document, first thing I've tried is that if any document doesn't load then there's no point holding onto this document any more for future attempts on this export fuzzing. So any document doesn't load, it should be discarded and not reused in the future. Stage one. Stage two then is this custom mutator thing. And the idea here is that you can customize what way OSS fuzz should mutate data when you're trying to generate a document to send to. So again, the top part there calls the base class one calls the ordinary mutator. And then at this point we see has it generated legitimate XML? If it has generated legitimate XML, then sure, go ahead and pass that to LibreOffice to load to see have you created a document that crashes the export. But if you haven't generated genuine XML there's no point continuing because the only thing you're exercising here is the import filter. And we're happy with the import filter. We want to get all the way past that layer of import to get right to the layout and right to the export and see can we achieve something there to find issues that we haven't found already. So that's taken quite a while. So what have we got from results so far from the export filter? Well, it's been a little bit underwhelming. Haven't found huge amounts of things. I know and I believe sincerely that there's a huge amount of layout hangs and crashes in the writer layout. I mean, that's well known. So I had hoped that we could generate documents that could capture these in a small amount of XML as possible. So I started off with a very, very, very small input limit size as mentioned early slides. And what I've got so far is I have one time out where the font size is just a huge point size. The document goes off to generate hundreds of pages of PDF and the famous 30 second timeout kicks in. So yeah, okay, you can create documents that you can create tiny documents that create multiple pages. So again, you're back into an issue of do you care about that so much? I don't care about that so much for this area. So I just limited the amount of pages that we export in the PDF export during fuzzing. If you, we just export the first 100 pages and say that's good enough and return back. So that's the first workaround there. Second one was more productive. A direct leak inside Ruby text if the Asian phonetic guide over text does not fit on one line and we split it to fit on the second line, then we had a memory leak. So that's been there for decades. So yeah, yeah, it's working. It is exercising the layout and is exercising the PDF export but I'm not getting the kind of results that I'd hoped to get for the time I've put into it. So maybe the documents need to be longer before you can get real world results. I'm bumping up the length of the document every now and then, every few days I make it longer and maybe we'll get to some point where we're getting better results. Or maybe there's something a little bit more fundamentally wrong with the approach I've taken to try and skip the boring import part and get straight to the more interesting export part and something alternative should be done there. Or maybe we have solved all the problems that I was trying to find. It's unlikely, but who knows. We're not too far from the end now so I'll just give the fuzzing statistics for the last couple of years. And not to show what year we started but as you can see, we had a huge amount of issues in the early years as we were still trying to work out. The low hanging fruit. The next ones then, years we were under control, it looks like it increases but what's happening there is that every time things are under control and things are quiet for a couple of weeks, then I add another fuzzer. So things go quiet, we add another fuzzer till you're at a stable level of results. So 2022 is I think the last time I added serious extra fuzzer than expected new results. So this year was fairly stable. I've added this new expert filter I've described. It hasn't generated a lot of results so we're at this very, very low level of issues there. I suspiciously low level of issues and that entirely convinced that we are running as often now as we were in the past. Maybe we're rate limited. Maybe we slow it down in general. Maybe some of our bills are failing more regularly than they used to and so we're not getting as much coverage as we got in the past. But it is working, what it is working for and I find really helpful is that when new things are happening and new changes are coming true and people are adding a new feature of the animated PNG or other work like that, you can see that it is being tested. The next morning and the low hanging new issues that have been introduced are found pretty quickly. So I'm pretty happy it's working quite well as the early detection tool for new parsing errors and any security issues that have been introduced in file format handling is being quickly picked up on even if I'm not getting the results I was hoping for to try and pick up those older write or layout failures that I'm convinced exist in there. And that's it. Thank you. Any questions? Thank you. Thank you. Give our guys a hug. Any questions as well? Mike is there. Yeah, I have a question. You have a huge job, but my question was about first place when we discussed typical problems and those were unsurrealized numbers. So my question is, how do we create some kind of balance that are more likely to use more newer that which features a new line in the focus of risers and it's going to grow so that it's economy? Yeah, what I wondered before is that there was this period of time where we couldn't use Coverti because we changed the baseline to a version of C++ that Coverti didn't support yet. So it's kind of made a little mental list of what are the most useful parts of Coverti that we're using and could we replicate them themselves with plugins and the things that came to mind were I think the only initializing one constructor is a common pattern. And to be fair as well, I think so many new Java programmers just not initializing a new variable full stop is another common issue. And we could do a lot of it with just maybe four or five basic line plugins would get most of the day-to-day issues. There are the other ones about finding code which is byte swapping data and then seeing that that data is being used elsewhere for loop conditions. That's a really good Coverti feature that I think is a little bit difficult to implement with a plugin. Maybe it could be done, but it's not obvious to me. So yeah, I mean, it's just easy to make a Coverti do it, but if it stops working again, I think those would be sensible things to do. Now, if this is Cisco saying that he's found another 50 million documents that have... Yes, I should have been a customer in Fortesco. I was wondering if it would be interesting to explore all the options like copying the file, like selecting everything that I'm using, copying, cutting, testing, rather than access that I quite often just don't use. And maybe it's very good that's more precious than why I'm calling it the total material of this would be stored in a customer space. Yeah, I think some of that makes a lot of sense. We have these... Well, they're probably quite old now. I mean, probably need to regenerate them, but at one stage I generated a subset of the documents we have and ran on through this minimizer. So these are the documents that exercise the most code paths. So we have this million documents. If we try and do something with every one of the million documents, it's gonna take forever. But if we have this subset of documents that these 15, these 25 documents or 100 documents, they exercise 95% of the import code paths, then maybe we could do take that subset of the documents and do additional things on them on every cycle, like what you describe, or even this 100 or 200 documents should be part of this continuous integration on every commit you do this amount of import and export on this known to exercise most code path documents. And other projects can have gone so far as taking advantage of the, like the word says, Fuzz has the ability to run fuzzing on every commit for five or 10 minutes. So we don't make use of these new things that are kind of semi-standard and GitHub. We're a bit big for a lot of them, but should consider either doing that or replicating it ourselves. And the idea of taking a small amount of these known to use most code path documents sounds good to me. And yeah, makes sense to me. And that kind of, the more automation, the better, the more automatic stuff that happens, I think is a good approach. Okay, thank you. Thank you.