 Hello and welcome to my talk about the crash reporting service used in LibreOffice. It is a pretty short talk, I try to stay pretty much on the top level side of things and the talk is structured like you see here, short introduction as always, then from the most high level we do into a little bit more detail about how QA can help with the whole process and showing some limitations and problems that the current set up has and then it's already time for credits and things of prior work and question time. So let's dive in with who I am. I'm Christian Lohmeyer, I'm mostly known by my NIC that I use everywhere, which is CLOF. I work as a Release Engineer and Infrastructure Administrator for the Document Foundation and have been with the project since its creation and before that I was active in the OpenOffice org project, so I have been around a long time and always on the infrastructure side or build both side of things. I hang out on IRC also with the NIC CLOF and you see my email here CLOF at Document Foundation org is another way to contact me. So to the actual topic of the talk the LibreOffice Crash Reporting Service and on the high level it's just the first question that comes up is why use a Crash Reporting Service. We have a framework of so many automated tests already, for example we are using a Garrett for the review process of patches and each patch that is submitted to our Garrett system is tested on all platforms and basically all our built-in checks run on each test by running the build time tests. In addition to those tests we are using different static analysis tools like for example the one provided by Coverti and others like for example the fuzzing tools by Google and those already catch many problems we have in our code without actually hitting the end user. And we have also a pretty extensive set of documents that we run our import and export and re-import basically round trip tests on that should cover most of the problems with our import and export filters and crashes. But of course even with all the systems we have in place we cannot prevent a bug swarm happening and especially cannot cover all hardware combinations and certainly don't have the workflow of each user covered for example different order of operations can lead to different results and that's why it's important to have an additional source of information especially for a disruptive bug like a crash that could potentially lead to data loss. So that's also the way what we are using to build our crash reporting service is basically Google BigPat everything hinges around the Google BigPat project and this is the component that creates the mint dump when LibreOffice crashes and is also used to basically unwind that information back at the server level. And the benefit of using a tool like BreakPad is that it allows for decoupling the debug information from the application that you ship. So and of course we also need a server to collect the reports and for that we have a small Django web application that collects the reports that are coming in and also is used to basically map the stripped down information that is created on the client side on the end user side against the full set of debugging symbols we keep on the server so that we have a working back traces and basically point us to the actual source code after everything is processed. And we also have some integration between the crash reporting site and our bugzilla. So both systems can cross link to each other and have pointers to each other for further information to provide steps to reproduce to hopefully attach sample documents if those are necessary. And yeah, this is a big help in the process as I will go to in a later stage in this talk. So going a little bit further into detail what is involved in going a little bit more in detail. Oh, that's embarrassing. Let's dive in what is actually involved in providing a build so it can be used as a tool for collecting crash reports. So of course, the first step is to create a build that has the debugging information in the first place. So if you don't have debugging symbols, even if you have a mini dump, it's more or less useless because human brains cannot work with pointers and there needs to be a way to unmengle this and track it back to actual code functions and files and lines in those files for to be useful. So after the build is done with the debug symbols, brake pads, dump symbols, tools extracts them and basically converts them into a format that is common across the platforms. And this then is uploaded to the server and allows the server to unwind the symbols back to their human readable form basically. And the debugging symbols are stripped from the LibreOffice that is shipped to the end user. So it can be relatively lean and doesn't have to be bloated just because of the debugging symbols. Of course, LibreOffice is still a rather large package, but if it were shipped with debugging symbols, it would be multiple gigabytes in size. And of course, once LibreOffice crashes, the brake pad tool creates a basically a snapshot if you want of the state that is run currently. And we don't want anything that could interrupt the current creation of this state. So that's why we don't try to do anything else but just to create this mini dump. So the interaction with the user, whether to report the crash or not, is all deferred to the next launch in order to just not mess with the state of LibreOffice after it crashed. And so the prompt that the user gets looks like this, and so basically a simple confirmation dialogue that for each crash asks basically whether you really want to send this information. You can disable the reporting globally in the options, then it will never ask and will never send a report. But if you have enabled the crash reporting in the options, you will still have the option to not send any report on each occasion. And the dialogue also has the option to start LibreOffice in safe mode in case it's a problem that is not tied to any actions, but happens every time you launch LibreOffice. So it's another feature that we have in here. And so the user then agrees to send the report and LibreOffice sends the relatively small mini dump file that was created to the server. And the server, in quotes, cheats a little bit by not processing it immediately, but just assigning a unique ID to the crash report and reports that back to LibreOffice. And the user can use that link to go to the crash report site. And from there it has, he or she or them have the option to create a bug report providing more information about the crash itself. And hopefully all reporters actually do this because having steps to reproduce a problem is basically the most important thing that helps QA in processing and judging the importance of such a crash. And if you're using the functionality to create a bug report, we make sure to automatically include the necessary information so that both the crash report inside can add a link to the bugzilla ticket as well as having a link from the bugzilla to the crash report. And with that information, hopefully QA has enough information or at least is able to ask the reporter, the person having the problem for more information to further track down the problem. And the dialogue with the response basically looks like this is very simple. It's just the link with the unique ID that the user can visit. And of course, hopefully they do and provide more details. And I said, and to go back to what is actually included in the Minidump, I glanced over it and said it's a snapshot of the state. And yeah, it includes the files that were loaded, the libraries and executes it contains the files that were loaded, the libraries and the executables. And it all contains all the threads and the state of the processor registers and the stack memory at the point of the crash. And some then some additional meter information like the processor used to run the system, the operating system inversion, and specifically for LibreOffice, we also of course include the version of LibreOffice itself. And we are also interested whether OpenGL was enabled or not and what the driver for the graphic card was and what version it was. So we include this as additional information in the Minidump. So next we have the problem of users report the crashes and what's next. And this is basically where QA team comes into play. And completely oversimplified QA's role is to monitor the crash reporting side and have an eye on clusters of reports that I knew that indicate there was a regression introduced in the latest version and then going the next step further, create test cases or at least reproducible steps by step and step by step instructions to make up, checking down the source of the problem easier and of course, to also be able to verify a fixer in the later stages of the process. And this is easier said than done, of course, but yeah. And this is easier said than done, of course, but hopefully the information given in the stack trace regarding the code was that was affected, as well as hopefully having a bug report with some background information about what the user did to trigger the crash will make this a little easier. And still, of course, if there is no corresponding bug report or even if the source code information isn't enough to help track down the problem or doesn't give you an idea and instead of trying to read hundreds of thousands of lines of code, you just want to do the practical brute force approach, you can use by bisecting and by bisecting is short for binary bisecting. And this just means that we have a git repository with binary builds of LibreOffice for all active branches. And you can use the git bisect command to very quickly track down a change that caused a change in behavior, basically, and without having to compile a whole version of LibreOffice between each step. And the benefit of is that you can track down a problem without any knowledge about the code itself. You just need a way to reproduce the problem. And then anyone with a little disk space to host the repository can basically test one revision after the other. And it always basically checks half of the remaining commits and always of that half the other half. So it narrows down pretty quickly, even if the number of commits that you start with is very large, the number of steps are probably 10, 12 or 15 at most in any case. And then you have either a single commit or a few commits that are affected and causing the problem. With that information, of course, you can have a look at who was doing this commit, what were they trying to fix. Most often than not, they have an old bug report assigned to it with the intent to fix this bug. And of course, having that information makes it easier to judge what other actions might have caused the regression from the regression to happen. And yeah, after you have tracked down the problem using a bypass sector repository, even if not knowing what the problem is itself, you have someone who committed code touching the area. And this of course is a hint who to poke, basically who to nag about the problem or who to ask for more insight. And this helps a great ways to, to further proceed in getting a fix done for this particular problem. And QA, while not necessarily doing the actual fix is pretty good at providing test cases to prevent the regression from happening again. Of course, this depends on the type of problem. But if it's a problem with the file format, for example, it's easy to just add a sample file to the code and check that the problem doesn't occur anymore. And next on some problems or limitations with our current deployment, basically. And it's hard to say whether it's really a problem or not. But of course, people don't update to the current version, at least not as frequent as we would like them to do. And thus we have lots of reports coming in that are from people using old versions of Lira office with problems reporting that were already fixed in current versions. So this, of course, kind of skews the overview towards old crashes because, yeah, the user base of new versions is relatively low or starts growing as time goes on. And the old versions are the ones with the problems that are not yet fixed. So they're just looking at the raw numbers doesn't give you an indication what is really happening with the current version. And the Lira office crash reporting website has some filtering options. At least the starting page allows you to have a quick overview of how many reports were were committed for each version. That is of interest, basically the final release versions and then has an overview of those versions of the reports assigned to that version. But if you look deeper, then you have all reports from all versions. So the filtering capabilities of the Django web app could need some love to make it more usable to be able to exclude versions from the listings, for example, or just focus on one single version to check whether it was fixed in an RC, for example, and not just the main code line version, for example. And so far, we also didn't remove any old reports from the crash reporting database. So of course, it grew quite large over the years. And why I don't have any immediate plans to clear out the old entries at some point, I probably have to do it to keep the size manageable and to keep the database query times on a reasonable level. And another limitation is that it's only available for Windows and Linux. But of course, Mac users are relatively small user base compared to Windows, as well as Linux user base is small compared to Windows users. So I don't think it matters too much to not have Mac OS in there. And Windows crashes and Linux crashes alone are enough to deal with basically there is no lack of reports coming in. But another limitation is, of course, that it only works when you have the corresponding debug information on the server side. And this then in turns means that it's only available for builds that are done by TTF, by me, basically. And only for the Alpha, Beta, RCE candidates, basically. But none of the of the Tinder boxes or daily builds have this integration and don't have the crash reporting enabled. And this is basically the technical side of the limitations or problems. And of course, there's the other aspect that I already touched upon that not all users go through the process of filing new information, especially if they don't have a Buxilla account already. This is a big hurdle for them. So they either don't even visit the site to file a bug report or they stop there and don't proceed further. And also those who do actually file a report, it's not clear whether they can provide enough information to have steps to reproduce the problem. So still a lot of the burden is on QA volunteers to basically track it down, try to find a way to reproduce it, try to find a developer who can have a look and what's wrong. And yeah, it's just as it goes with all problems. People either complain loudly about it or they stay silent and complain about it with their neighbors, but not to report it. And there's no real easy solution to fix all this because either you make it so easy that every spam bot can abuse your system or you have some lock in system in place that scares people away. So yeah, it's always a balance you have to take. And yeah, we have enough reports coming in that I think are now reasonable to act upon even if not everyone can add a comment. And of course, having a comment system that allows everyone to comment also gets into the data regulation problems because then you have no control about what is filed and how people would access this information and how it's shared. So having a system that has its own management lipoxyla makes it a little easier. And this basically is it already for the prerecorded portion of the talk. I'll have enough time for questions, I think. But of course, I don't want to leave without giving thanks and credits by credit is due. And most from it is Marcus Mohert Morgie. He basically did create the system back for LibreOffice 5.2. I think it was 2016. So quite a couple years back, he did the bulk of the integration and wrote the Django web and taking some inspiration from the Socorro. The tool that Mozilla was using for their back reporting. And also Nobbya Tiabu had a lot of work done in the infraside of things. He was basically setting up Garrett and other double-rated stuff back in the day. And then there was also Ricardo Margarcetti and I sorry, I probably butcher the name. Albuquerque Estimier contributing to the server side of things back in the day. And yeah, of course, all wouldn't be possible without having the breakpad utility available. So have a link there to their main repository and of course links to our Crusher parting side and our backzilla. And yeah, basically, now I'm ready to take questions and switch over to some live mode and maybe can do some live demos if necessary. And of course, thanks to all the sponsors making this conference possible. Thanks. Thanks. And I saw a question on the room regarding backtraces not being resolved against the symbols. And this was a problem with the symbol extraction step. So the dump sims process did frequently sac fault when processing the symbols for Windows 64 bit and our tooling didn't bother to check for any error during this processing. This has meanwhile been fixed and it should resolve the symbols again. So it was a problem that the debugging information on the server to unwind the symbols wasn't complete. And yeah, but for current versions, it should be all in the good. And of course, the old original reports are not resolved, but even reports for old versions should now be resolved again, since I also updated or re-uploaded the debugging information for old versions. And yeah, to add on that, the problem still happens. So the dump sims to a still sac falls frequently while processing the symbols, but it's just retried until it does succeed for this given DLL or XFL and proceed to the next one. So running it multiple times until it succeeds. So I didn't really bother finding out why it crashes or how the crash can prevent it as long as the result of a successful run gives the same result all the time. So I'm confident that the data is correct if it works and only have to rerun it when the process sec falls. And maybe just to show the crash reporting site as it is now basically on the landing page, you're seeing the number of crashes submitted per versions. And so the different colored lines represent a single version of LibreOffice. And you see that around 400 to 600 reports are coming in per version. And of course, you see the slow rising lines at the bottom. That is the new releases that slowly gather a user base. And still most of the reports are from old versions. If you're actually using it, I'm not sure whether you can see the mouse hover in the screen share. There you should get a breakdown of the versions and the number of submitted crashes at that specific date. And if you go to the version select on, you can select, for example, the 7162, basically the 716 final release. And it's a little bit slow for me right now. The demo effect, okay. Now that you see a listing of the crash reports and the signatures for that specific version. So for example, here you see a SCIA related one that is happening or was reported 74 times for Windows in the last seven days. So the timeframe can be chosen at the top. And if you click on that, you would get to the number of individual submissions. And you can click on one to get basically the the stack trace or you see the function signature and the source code file where it happens. And also the cross-linking to the bugzilla and clicking on that gives you the bug report where you can see the history basically. And it turns out it was an optimization that uses processor features, the AVX instructions, despite not checking for actually support on the hardware. And it was fixed pretty quickly then. And it's fixed for 7172 branches. And yeah, or just showing at the top the information that the crash reporting site would add. So this bug was fired from crash report. And there you have the idea, the idea of the crash. And this can be used to go back to the crash reporting site to see it. And yeah, this is basically for the collect. We have the version of LibreOffice, the idea that used to uniquely identify the crash, the processor architecture, the operating system, and the operating system version. And yeah, this is basically all this. And