 So, hello. My name is Michael Stahl and I work for Red Heart and this talk is about automated testing in the LibreOffice project. Okay, so a bit of an overview, so first a little bit of introduction about what the topic is and the second part is about all of the various different kind of tests that developers can run every day on every build to check that they do not add new regression bugs. And then at the end we will have a short look at some even more obscure tests. So why automated testing? So in LibreOffice we have in every release something like on the order of 10,000 commits and we change something like a million lines of code every release. And the obvious question is if we do that many changes, how do we avoid introducing new bugs, introducing regressions? And a substantial part of our overall strategy to avoid that is to do automated testing. So basically our goal is that the developers find any potential bug that they introduce before they push their buggy changes to the repository. And so part of that is that also the developers also have to be able and willing to write the tests themselves. So we can avoid the worst case scenario because we don't really have a test organization or anything like that. So the next question then is what requirements do we have for tests? So for unit tests the first requirement is that we want to be able to use standard unit test libraries. We want to use the standard libraries that developers may already be familiar even if they haven't contributed to LibreOffice before. And we also want to be able to run all of the tests as part of the standard build. We don't want any developers to have to learn how to use any extra tool with a complicated UI or whatever to run the test. The developer has to be able to just do a make check and then all of the tests run automatically. And if the tests are all successful then they don't produce any output and if any test fails then it should be very obvious what exactly has failed. So the next requirement is we want to have reliable tests. So it's not very useful to have tests that fail randomly 1% of the time when you run them because that will then train developers to avoid running the test and to not trust the result of the test and that's obviously a bad scenario that we want to avoid. Also the tests have to run quickly. So we don't want developers to wait 10 hours for the test result because then they won't run them every day. And the next requirement is that we want to have a good defect localization from test failure. So if a test fails it should be relatively obvious what exactly has failed and where the code in the office code is exactly that has triggered the failure and where to look for the bug effectively that was introduced. And the last requirement that we have is that we want the test to be debuggable if it fails. So it should be easy to get the test to run inside of a debugger and so that the developer can easily investigate and find out what problem was introduced. So now we continue with the various different kinds of tests that we have currently and the first category of tests is the CPP unit tests. These are implemented in C++ and they use the standard CPP unit library. And actually currently this library CPP unit is maintained by Markus Mohr who is a very prolific LibreOffice contributor and has actually written large numbers of unit tests with CPP unit and has also written various different basically frameworks for writing more tests. So it's a huge achievement there. And these tests run in the same process, in process. So basically you have a single process that executes both the test code and the office code that is being tested and that makes the tests easier to debug. So we have various different kinds of CPP unit tests and so basically at the top level categorization we have unit tests the simplest ones are a simple test for a single C++ class and then we have tests that test a single unit component via its API and then at a bit of a higher level we have integration tests, test several components together or maybe an entire application like Writer. And there are many different kinds of these. I couldn't possibly list them all and there are some particular interesting kinds like we have these filter crash tests which are a bit notorious because what they do is they essentially load a bunch of test documents and then we check if we didn't crash loading them and these are all documents that were created to demonstrate previous security issues. The problem with that is that various antivirus products will then complain about these files so we actually store them encrypted in the Git repository and decrypt them, they are testing and that is why you should disable antivirus products during a build because they cause spurious build failures effectively or you can use a configure option to disable these tests. And yeah, filter test I will talk about in a bit and then at the highest level we have a whole system test effectively that is the smoke test which is only nominally a CPU unit test because the actual test code is loaded from a document and it's actually a basic macros and this test does a bit of very high level stuff like it also installs an extension and whatnot so now let's have a look a bit at the filter tests we have lots and lots of these and they basically all work in the same way they import a test file and then they check that some property in the file was imported correctly and then they export the file again and import it again and then we check the same properties again to check that they were round tripped correctly and we can also with these tests validate that the written files are valid ODF or valid OXML files if you install these extra validators yes, we do have a tinderbox that uses the validation yes, you will get a mail the next day I think I think it's not a fast tinderbox but yeah so this is a very confusing graph where I've tried to investigate basically the story here is the CPP unit tests are a growth industry we have the blue line in the chart is the number of make files and you can invoke every one of these individually if you for example have to debug one of these tests and the blue line uses the scale on the right so we have currently more than 200 make files and then the smaller bars the bars use the scale on the left and the smaller bars are the number of CPP unit test functions and the larger bars are the number of individual assertions each assertion checks one particular condition so as you can see we had some tests already initially in 3.3 but we add lots and lots of these tests over time so the next category of tests is the J unit test and these are not surprisingly implemented in Java and they use the Java standard testing library which is JUnit so I found two different kinds of these tests basically one of them is simply unit tests for the Uno Java language binding and these actually run in process and then we have the so called complex tests which actually run out of process how does that work the test code so basically to run the test we invoke Java with the JUnit test and then that launches a separate S-office process that is a whole LibreOffice instance and sets up a remote Uno protocol communication between the Java side and the Office side and then the test function calls go over the remote connection and the result values go back and there are two different subcategories of the complex test so you have some of them, a lot of them test a single component so basically these are sort of like unit tests and then there are also a few that test an entire application and create lots of different things and test them together so what is the growth situation with the JUnit test as you can see actually in the beginning from 3.3 to 3.5 that is a measurement error because as it turns out we didn't actually write new tests there but the tests were previously in some custom test framework and they were converted to JUnit but since then there were not very many of these tests added so it's basically more or less it stays constant so the next category of tests is the Uno API tests and these are very much unloved by developers they live in the QA-DEF-OOO module these are also implemented in Java but they do not use JUnit they use a custom testing framework they also run out of process via the remote Uno protocol and these tests are very notorious for having a very obscure test code where it's hard to figure out where the actual test code lives that is being executed and how the test environment has been set up for a particular test and so on and also another problem these tests suffer from is that they are essentially black box tests written generically against a Uno interface and so they are not very thorough tests and they do not take specifics of a particular implementation into account and all of these are essentially unit tests of a single Uno component so how does the growth look like? basically there is no growth nobody is adding new tests here actually the numbers here are not really comparable to the other graphs that's why I've used different colors because there are several mechanisms to disable these tests and some of them are disabled so it's very hard to find out statically which of these are actually run and which are disabled so I actually grabbed through the log file to generate this data and what is being plotted here is the actual Uno interfaces that are being tested and the actual number of components that are being tested but what you can see is that the number is essentially constant the only change there at the end is that somebody split up the make files and that was only done to make the test run faster essentially so the last kind of tests that we have are the Python tests and these are implemented in Python and then they use the standard 1 I'm not sure if it's the standard but the standard library for testing in Python that is called unit test and these tests actually run in process and basically they are a very recent addition and mostly the work of Dawid Ostrovsky so we have two different kinds of these so for the PyUno language binding there are some unit tests and then we have also a unit test for basically various things in writer that are essentially similar to the same thing done in JUnit just in Python and we don't actually have many of them because they are relatively recent the first one was added in 4.1 and I'm not actually sure about the things I found in earlier releases actually are should investigate that I guess and we currently have eight make files for these and less than 50 test functions so really not all that much so now we have seen the four different kinds of tests and this is just a bit of an overview so you can quickly compare how many of each we have relatively so with the PyUno API interface is somewhat different measurement so what you can clearly see is that we have more CPP unit tests than everything else combined but we still have relatively many of these unit API interface tests so now a very interesting question is now that we have seen these different kinds of tests how well or badly do they meet the requirements that we have seen so first I should mention the first row is the historic VCL test tool and as you can see from how well or how badly it meets our requirements we actually removed this test infrastructure several years ago very early in the project and yeah this basically is an explanation why and now for the requirements so basically except for the unit API test every kind of test uses a standard testing library and we can run everything with make check so that's nice and so basically our CPP unit tests that are actual unit tests are sort of the best case for the project because they are reliable because they test only a very small unit and they run very quickly and if one of these fails then you know very well where you need to investigate because the unit that is being tested is very small and they are also debuggable we have infrastructure in the build system to quickly run a single test inside of GB or within Visual Studio so that is working nicely and sort of at the other end of the scale for CPP unit tests we have the filter tests which well they have also the advantage they are easily debuggable and they are fast but there have been some issues with the reliability of these tests so what can happen is that you write a test and it runs successfully on Linux and it fails for unknown reasons on Windows or it fails on Mac or it fails on... it runs successfully on one Windows tinderbox and it fails on a different Windows tinderbox or it sometimes randomly fails we've seen cases like this and yeah basically the problem here and also the reason for the poor defective localization of these tests is that the unit they are testing is too much so too big when importing a file the data passes through various different layers for example for an RTF file it goes through RTF tokenizer then through the domain mapper then through writers you know API implementation in the writer core and if something goes wrong anywhere you don't know where exactly it went wrong and so debugging a test failure does take some time and that's not really ideal so the next category is the complex unit tests and yeah these are also quite fast and if they are really unit tests then they provide good defect localization when they fail but they also have problems with reliability but these are different problems here basically these problems come from the out of process nature of these tests because they use remote unicalls they run into the problem that the implementation in LibreOffice itself has lots and lots of threading issues and so you can get rather non-detamistic results at times and yeah that may lead to spurious test failures and yeah also the out of process nature makes them very difficult to debug because well you have two processes so where do you attach your debugger you need two debuggers and yeah it's complicated and so the next category is the unit API tests and in addition to the reliability problems that the complex tests have these have some more problems especially some of the tests that are for accessibility related interfaces are notorious for spurious failures that nobody has been able to track down yet but we haven't deleted these tests because for many things they are the only tests that we have so we don't want to give up the coverage and it used to be that these tests were also quite full of delays and wait statements but I think somebody has recently converted all of those with the new VCL main loop improvements so that they wait for idle and so they should be running faster nowadays although I haven't measured it and also when it comes to defect localizations these are not good because of the obscure test code so if you have a test failure it's not easy to find out where the actual test code that has failed even lives and so that's another problem then for the Python tests so far they have proved to be reliable but since we have only very few of them now I wouldn't trust that so we don't really know if we had as many as we have the other categories if they would still be reliable but in terms of performance they are good and they provide good defect localization and when it comes to debugability they are somewhat between the CPP unit tests and the J unit tests the good thing is that they run in the same process and the bad thing is that you have two different languages you have the test written in Python and the implementation that is being tested is written in C++ so that creates a bit of a problem but GDB at least has a couple of interesting helpers that allow you to print the Python backtrace and that sort of thing while you are debugging but I don't think that an equivalent for Visual Studio exists so yeah, it's not ideal so much for the requirements now a bit of a short aside one thing that actually helps with the test is just using assertions in the product code if you want to check that some the state you are in is actually valid and that sort of thing and input parameters are valid because all of these assertions if they fail they will call a board and that will crash your test process and that is then effectively a test failure but the problem with that is of course that if the test hasn't been specifically written to exercise this part of the code then you don't actually necessarily know where the problem is going to be and yeah, assertions we have basically started with that in 3.4 or 3.5 something like that and we have added quite a lot of them over time so it's good to see that so now that we have all the tests it would be interesting to know how good the coverage of the test is so how much of the code is being executed during the test and in order to do that we use special builds with GCC options that do instrumentation for test coverage and then we run this Elkoff tool which creates some nice web pages out of the resulting data and that is I think it's run daily or maybe weekly I'm not sure and you can go to this Elkoff.libraffes.org web page and look at the nice results and see what areas are being tested and which not and I think all of this is thanks to Martin Hirst who did the scripting work and whatnot to get this up and running so what does that look like unfortunately I found only four historic data points for this and this looks a bit weird because why they are not equally distant from each other is because the dates when these results were archived are not equally distant from each other oh it's in Jenkins I should have looked there yeah well, disregard the graph no but seriously so what you can see on the left scale is the number of lines of code and the blue ones are the lines that were executed during the test and the red ones are the total lines which are a bit more than 2 million and it's less than other figures you may have seen reported because this is only the lines that have actual statements in them not empty brace or something like that lines and the blue line is the coverage in percent and we started out with like 38% and now we are something like 42-43% so the general trend is in the right direction so the next thing I wanted to just quickly mention is that so that was all the tests that were running by a make check in addition to that we have a bit more sort of complicated tests that we do run regularly but they take a bit longer so it's more of a manual process one of them is the so-called crash testing where we import and export a huge pile of tens of thousands of documents and there is a separate talk about that at this conference and then we have a performance test which is using car grant to profile certain things and that can be run by the make perf check target and this is also somewhere run automatically probably with Jenkins and so you can go to this website three or four hours they take and you can look at this website and look at the nice graphs of the results and see if we have become slower somewhere so this is thanks to Matus and Laurent who apparently did the work there so basically that was the talk and the lesson of the talk is clear you should all go out and write more tests because we don't have enough coverage yet and in fact instead of listening to my talk you should have written a test instead that would have been vastly more useful so yeah, thanks any questions? because there is a cert, is that a cert, do you have a cert? well, it depends, so yeah, also a cert and dbg-assert are essentially deprecated the question was which sort of assertion should we use in the code and the best way to assert something is with the standard c-assert macro and that has the benefit that if the assertion fails then it will actually abort the process and that will mean that developers who do their debug builds with assertions enabled will notice that and they will hopefully investigate it and then there are various other legacy assert macros like dbg-assert and dbg-assert and these are pretty much all deprecated and these basically just print out something on the standard error that assertion failed and everybody ignores that and it's not that helpful but there is a use case still for that sort of thing is when you have some condition that you want to check which is often an error but you can't be sure it's an error and then there is something equivalent and that is the new SAAL, what's it called, SAAL-worn and yeah, so that is the non-deprecated equivalent to Ozil Ensure and what not How do you write a plug-in to convert almost dbg-assert to SAAL? The problem with the easy hack to get rid of them is that some of them should be true asserts while some of them should be SAAL-worns and no compiler or normal automatic or no easy hack or set-all which should be which so this is a hard hack actually If you check with debugging an error and see if you check with war about all the places where assertions and debokers are produced and so on and then de-reference happens you get a list of all the cases around these non-reserts which should really be our result and then if you go to the rest of the group you can get a list of the legacy asserts that you can use so this is your list and check which one even if it's individual you can get this one out So you use that to test So are you interested to try this? So this test can happen if you have five conditions in the old surf because I've been hearing for quite a while now that there is all those things because some of them may be real assert which today they don't actually assert anyway we refrain from getting rid of them and keep those things around forever so at some point we might decide that none of them are real asserts for the few that borrow will put them back eventually Yeah, the issue is if you see a SAAL-worn you assume that it's not supposed to be a real assert whereas as you see and also ensure you see that it's some legacy old stuff and you know that whoever wrote that didn't think about the question Go together and reintroduce the one that makes sense or start by making who's an assert and who's a debug failure? So I thought I think we are very late so I think we should stop this discussion now