 So hi everybody, I'm a bit surprised that the whole the room is pretty packed. So thank you for coming and thank you for having me here Today I want to talk about the mutation testing and the way we should we should actually take to to leave the stone age so a few things about myself I am a US developer by day and In my spare time, I'm actually hacking on compilers and this is like basically my hobby and This project actually my hobby itself so if you're curious you can find me on Twitter and I have two blogs where I write about some low-level stuff and LVM itself and mutation testing also and especially if you are into software testing and quality of software then I recommend you to Look at the last Link the system under test org There I and friends of mine. We are writing about different software how it's tested and so on so forth So it was shameless plug Yeah, this is the the outline for today's talk I will start Was the quality of software then I will talk a bit about the unit testing then we will move towards Mutation testing then I want to present the tool that I'm working on and at the very end I will try to present a showcase like a real-life usage of this tool, but I'm afraid that I will not be able to to to give you like some stellar Examples, but that's gonna be interesting by hope Yeah, so I do software for about like seven Yeah, seven years now, I think and I enjoyed a lot But one thing that I'm concerned a bit is the fact that software is just like broken like essentially everything is broken Like browsers compilers separating system games just anything it's full of bugs and it doesn't work and so on so forth So we as developers as computer scientists and engineers We are trying to put as much as possible efforts in four decades to improve the software somehow and there are some Some ways to do it. So for instance just just today. We had a great talk about formal verification of programs But I don't know much about this topic, but as far as I know We are not there yet. The tooling is is great It's improving but it's quite hard to to verify C or C++ because of this and find behavior and so on So another great approach is fast testing it Yeah, I think It's like here like for a while and now since we have better computational power we can apply it more efficiently So but I think there is no doubt that unit testing is the most widespread Technique that is used by software developers and yeah, it comes with the nice metric as well it which is code coverage and Yeah, today I Want to focus on just on this I'm not gonna talk about like flaws and advantages of fast testing or for Modification, but I definitely want to talk about unit testing and its problems So yeah, how it works in generally in general We have some code and yeah, this is like artificial example just to to to start So we have some functions some that takes two integers and returns a sum of them and for this function What developer does he writes a code like The assertion is that the sum of five and ten is more greater than zero, which is absolutely correct It's valid statement, but it obviously it's not enough because Five multiplied by ten also were greater than zero so if we run this test Then we see that zero tests has failed one test passed and the cut coverage is 100% percent So it definitely has some It might be misleading the code coverage metric might be misleading and I see that Sometimes people are striving for number for this number, but the code quality is Not the code quality, but the test quality is not as good And there is no way to measure it. Actually, there is no just no way how to assert how good our tests So here comes the mutation testing so I will try to explain the algorithm behind it like very briefly if If there is something like completely unclear or like really confusing just feel free to like to to interrupt me and dust I will elaborate So yeah, what we do like normally was mutation testing we have a program and we have a test suit We then execute this test suit against that program And every if everything is good if tests are passing then we do the following For the program we take a mutant we created mutant out of the program So we basically do some mutation on it and it must be some Semantic change the program should be different then we Take that mutated program and run test suit against it again and Then we check the result if the result is a failed then we claim that This mutant was killed This is essentially a good thing So if you introduce if you run a test against mutated program, then the test should fail If that's not the case then we claim that mutant is survived and this is kind of bad So it means that something is wrong either code or the test So here again our initial example was the test and the program so out of this program we can generate Many mutants, but I will just show like a couple of them So first one is we replace a plus b with a multiplied by b and the second one We replace a plus b with a minus b Now if you run the test against the first mutant then it is still Passes so it it yeah mutant is survived and it it gives us a hint that something is wrong However, the second mutant is well was killed and we are good to go here so Based on this example, we can derive your numbers so the total amount of mutants is two one mutant was killed one was survived and mutation testing also comes with the metric called mutation score and is calculated by the formula on the screen like killed Mutants amount of killed mutants divided by amount or like total amount of mutants and multiplied by 100% So in this specific case the mutation score is 50% and it's not good So again, you should not strive for 100% because you probably will not reach it anyway, but You can use it as just as a hint. So higher mutation score is better Yeah, few facts few things about mutation testing. Oh spoilers Yeah, sir so it was First proposed by Richard Lipton in 1971. I think it was called mutation analysis So these are the same kind of the same terms. So it was first implemented by Timothy, but in 1980s And I think it was for Fortran, but I'm not sure. So don't quote me on that There are several studies like quite recent studies that show that a certain mutated mutation testing systems for certain programs. They could help to find a box like Real false like quite high number of them like 70 and 90 actually 94 92% like to to be more precise but even Despite this fact mutation analysis was not or it was not widely spread because it has several problems as well as just everything so the first problem it generates like lots of data for just 10 tests for a small program you can end up with 100 mutants and if you Execute this like those 10 tests against those like hundreds of programs Then you end up like to running the program like 1,000 times in the worst case Second one, yeah, it's time-consuming because of this like the the data we have lots of data We need to process it to execute it takes quite a lot of time Languages they are not not all the languages I would say are friendly for mutation testing because It's I think it's kind of easier for languages like Ruby or JavaScript because you have just like a wall and well YOLO so but languages like C or C++ they are more they're not as friendly because You essentially need to to mutate the program like to change the source code like maybe AST, I don't know then compile it then link it together with everything and run and then assert it just quite slow Though I don't have real numbers Yeah, there is also a problem of a human test oracle so let's say we generated lots of mutants and We cannot just say like whether are they are they good or not? So we need somebody to actually look at those results and make the decision Yeah, it also comes with another problem, but I will tell a bit about it later It's actually and yeah another one problem. I was surprised by that I heard this like almost exactly this phrase like excuse me, but I write good tests. I Don't believe it and it for me. It sounds like excuse me, but my code is back free, which is well not the case My code is back free, but yeah, not in general Yeah, yeah, so the tool is called mal not mule so It with this tool we are trying to to solve some of those problems But not all of them and there are some things that we do to actually Improve the performance and make it well better in general. So we have kind of smart mutant selection it means that I Mean it just helps us to to decrease the amount of mutants and amount of data We need to process and it's very similar to you know the second point So we provide a means to control the data generation. So I Will elaborate on it. Yes, it's complicated as well bit So we also utilize the runtime compilation and jet particularly we use a low VM for that and it means that we We don't need to link something. We don't need to recompile things that much and we completely eliminate I owe bottleneck of I owe in this case so Because of that we are operating on the level of LLVM IR which gives some nice tool things about also comes with problems like You cannot sometimes you cannot Like correspond the IR was the initial code so they well they might be different especially with C++ with all these templates in lining and so on so forth and The great thing that it is language agnostic Given that all the languages are built on top of LLVM. So if something is built on top of LLVM, then it's likely that we can use it Yeah, so time to elaborate a bit on How the system works so this is the typical program we have like several source files bitcode files modules What not they have some functions and instructions and if we just go and Analyze the whole program as is then we will find out like many many many mutants many many places where we can introduce the mutation But it's not very efficient because because muscle like maybe not most but big part of this code is like just unreachable for whatever reason So the bet the better approach would be as what we actually do first we try to find the actual tests thing that are well tests essentially and Based on this those tests we take the call sites like call functions We are trying to build so-called cold tree and Based on this cold tree. It's like on this slide. You can see some modules. They are not needed So we can just eliminate them completely. We don't need to compile them We don't need to analyze them just I don't consume less memory and so on so forth so Yeah Whereas with this optimization it is actually still experimental It's a bit more tricky than I thought before and it has some problems, but it is it is definitely the way to go so for example in this slide We took some subset of tests from LLVM itself the target called IR tests and we just took like two and a half hundreds of tests and Before to actually to run this the whole system to process it We needed to look at like 400 modules and it took roughly Yeah, 85 minutes after this improvement like cutting those like unused modules We got like three times less modules and almost to it works like almost two times faster But again, it is experimental. It's not in master trunk yet, but we're working on it. So here is the most complex program like closer to reality and here we can see that on the left The the functions on the left they are tests and they go some way to like wires as some distance And it has some So for the same subset of tests we took distance to it produces the 100 1000 and a half Mutants and the Real execution time that I measured on my machine took like about one hour, which is kind of acceptable but for on the other hand if we took the whole program then The distance is 29 for for this group of tests and number of mutants It's I think there is actually back. It should be like orders of magnitude more like maybe two hundred thousands of mutants and Approximated approximation of like execution time is roughly 11 days, but it's very pessimistic. So in reality, it will take maybe seven days six So Yeah, that still doesn't help Yeah, and with with this numbers actually with this mean to to to change the distance to control the distance Even if it takes one hour, you cannot probably use it on your machine like daily basis because well It's kind of waste of time But it still can use it as like with the night night the bills for example And for Yeah, you turn control. Okay Yeah, there is another mean we can select some specific tests so let's say you have like 200 tests, but you are interested in like one or like some group of tests and Yeah, so basically just like select one test and work with it was like even like you reduce the amount of tests even further and I Didn't put numbers on slide, but the the I think the best case was like 30 40 seconds maybe for one test and the worst case for a group of tests, which is My 15 or 20 took like 15 minutes 20 minutes, which is like still reasonable and it enables like really really fast iterations So I I think this is like the the best improvement that we did so far the best kind of invention Yeah, so Few words about the system design how it works in general. So this is like in a nutshell The program consumes as an input a config file, which is just a yaml which is doesn't matter What it contains is the list of bitcode files Basically shows which modules where to get modules to to process and it also like has This settings for distance and some cash and so on so forth So but what the program speeds out is the another config file and the sql file so that another config file is Reduced version of the first config file. So let's say they initially you provide a config with 400 Modules and you found out that you were like used only hundred of them the next time you can run the program with that reduced config and You will not need to process like the 400 but rather 100 so it's kind of also improves the iterations Yeah, we also don't provide any so yeah, I mentioned the mutation score But the program at this very moment we do not provide any like short report like Your mutation score is something percent because again as I said, I believe that we should not strive for numbers We should just use them as a hint and what we provide instead of this is the sql file Which contains information about just everything and it has the advantages So if I run something and I got results I can treat them in any way I want and I don't have to restart the program again wait like hour or a week and Just again iterate faster So this is yeah the program from the like outside user perspective So internally it contains Consist on from several several modules one is kind of core. It has just a driver that Well controls the basic system. It has a reporter and it will be Likely extended or replaced in the future. So instead of sqlite you can report to the as like output or whatnot One essential part actually is mutation operators. I didn't talk about them before so I will just Briefly explain what's that so every mutation tool mutation testing tool They have some mutation operators. These are things that like kind of rules that Describe how you change your program. So it could be like replace plus with minus like remove void function or replace like a negate condition for example or I don't know like skip the whole loop for example or like anything you Can imagine there are also some studies that do the mutation operators for for Java. I think and they Go in further and they change the class hierarchies like they inject some classes remove and so on just to screw the system up even more Yeah, the second part is also like straightforward. It's a tool chain Maybe it's not the best name, but it just like jit compiler from a little VM and the object cache So when once we run the program once we usually don't recompile Biggest part of it with the next run. So it's also improves the speed Yeah, the most important in my opinion The most important piece the most important thing is the test framework. Basically, this is the last one thing now that is abstract because it can be Like Google test for example if you want to run the tool against C++ then we have a driver now and like all this infrastructure for Google test It could be also like exit as for example for objective C and Swift potentially and just I think two or three days ago We merged the initial support for for rust it. It's not production ready. I mean the whole tool is not production ready, but the rust You can just take it and apply on any project, but We'll get there Yeah So kind of showcase Initially like month ago. I wanted when I was planning the slides and so on I wanted to run the tool Against the LLVM itself and guess are some results that I went to it and find something probably something interested interesting but the problem that I faced that to actually assert the To actually make sense out of those mutations. I need to know the domain of the tool Very well, which is not the case. Unfortunately, I don't know that like the most part of the LLVM like AP Float and some other stuff But the results are available online They might be cumbersome and unclear and maybe cryptic because I am in this context and I don't understand everything there, but It's hard to understand like it's hard to know for me what people may need So if you are into it and if you are interested, you may take a look and give some feedback. I would really appreciate it Yes, so again the some numbers for example that IR tests Yeah, you can see here the mutation score is 43 percent, which is quite low and I actually saw Yeah, quite few places that can be improved But again, I cannot just like do it easily and fast. I need to to dive into this matter Which I didn't and the same goes for ADT tests. The mutation score is a bit higher on 66 percent But it is still quite low. I would say so what I did instead I took the one part that is Quite small and the code there is quite straightforward. So even I couldn't can understand it This the triple test, I don't know if you're aware what the triple is but it doesn't really matter. So just just some group of tests and Yeah, the mutation score is quite high, but I still wanted to see what's going on there and First thing I found is this kind of tests So we have triple we change the architecture and then we ask for Well, then we assert that some Like in this case like other architecture like based on that initial architecture is like exactly as what we need So what happened what's happening behind the hood under the methods like get little Indian art variant there is just like huge switch statement and that just maps things together and What we did we basically started like removing switch cases like one by one and we found that many Things they just can be removed and the tests are still passing. So in this case, it just means that We just need to add like to add more tests Yeah slides, I think switched but okay Yeah, so one could argue actually that this case can be found by code coverage and this is like absolutely true It will be there. I'm pretty sure but the nice thing about the mutation testing in this case that you get report like this one and you clearly see like What was removed like let's say you remove this like this case And you see like what what which tests were affected by this So basically what what tests you need to extend to improve to cover this case and it took me like just the whole flight from Berlin to Brussels like one hour maybe to improve this a bit and Yeah, so I just like added I covered all the cases switch cases that like in on cases that were not covered and yeah I committed this I think last night So it's in trunk now The second one is more interesting in my opinion So yeah, the test is like straightforward We set some property of an object and then we assert that property is exactly Like equals to what we just like used but this test actually will pass if we just comment out this line Just just because the elf is default value. So there is no I mean. Yeah, so what mutation did in this case Set object format the body of method was kind of removed like completely And the test were still passing. So also what I learned from this very specific example So if you have situations like this then it may make sense as a developer to cover like two cases At least two cases and then it's very likely that you're good to go and you are not asserting developed default value So again, I just like added like two lines and also committed it I Don't check for default initialize to not default in every test Can you repeat sorry now that you had a test case which test the people yes, and now you have a You have claimed this is true. Yeah, and then every all the other test case you initialized something which is not Yeah, this is how it should work, but this is not how it works. Yeah I'm well in this case actually you should have replaced the Mac O as first and the alpha second Because you still don't need the first set object format. So I think it's a bit trickier Okay, I cheated a bit Okay So the elf is not like default like is like completely default it based on so it's default for Linux Yeah, it defaults on Linux it but is different on OS 6 and even I run it on OS 6 machine the triple initialize to default value like it when you give the empty Empty string it thinks that it's Linux machine no idea why but that's how it works Didn't I didn't have this goal at all Yeah, so another example I it's not like the code exactly I think I found it in IR test, but I might be wrong But the idea is that you have something like if something then we use slow version of an algorithm Otherwise we use fast version of all the algorithm and we have some tests and if we basically introduce some mutation and we flip those branches, then the test still passes and again, one could argue that this is like Fine, but I would say no because the only notion that Something is slow and fast is like basically the name of a function, which is like not always the case there might be just some comment and It might be that it might be true as soon as like the developer like wrote this code in the first place But like next year somebody fixes a back in the slow version and it becomes not that slow Maybe and somebody else fixes back in the fast version and it became slow. So It in this case there should be some tests like to measure like performance how slow how fast maybe they are like not Maybe they are the same by performance and there is no need to write this code at all Yeah, I think that's pretty much this pretty much it about the showcase. I wanted to show So as I said the tool is still in progress and we have some open questions The first one is the most important. I think the most kind of cumbersome is the integration so since the tool is working on the level of a low VM bit code of IR we need to get it somehow some like from where from somewhere and To get the bit code for for tests from a low VM. I needed to use C make then ninja then Like bloody shell scripts was set grab and so on So we're just not straightforward if you want to use it on your project like plug-and-play. It's not we're not there yet So this question is still open and we are trying to To find the best way to do this to improve the thing. So the another kind of problem is you X It's still not clear again like how to do this like properly. So was What we did initially this escalate file and I wrote like bunch of ruby code to generate this like pretty and nice HTML and it really nice and it works for me, but I don't know how it works for other people and Again the question is still open. We are trying to find a way to To make the system usable for for people So another one I Think it's not not the question anymore since we merged support for rust We were talking about like which language we should try next because right now the system works fine with C++ But I am afraid that the system the current system design can be biased by C++ itself So we are wanted to take another language and to make the system kind of flexible and well good And there are obviously many many many unknowns So we don't use this tool in production yet. We don't have any users so far so if you think that you're interested into it and you want to try then I Urge you to contact me and I would be more than happy to help Like free and so on like just for for sake of the tool Yeah, so the project is available on github. It's open source. It's like we're developing it well in open way So if you have some questions or you want to get it give it a try and I will gladly provide your support for this And just drop me a line at alex at low level bits dot org Yeah, and for some updates You can either follow the project on github or follow me on Twitter and I am posting them Updates on the process Questions please Yeah, so I'm I think yes, so I found one one test it's quite lengthy and it has some quite like another lengthy function and There are many like ten assertions in this test And they all check whether the method returns true. So it doesn't check the like the It just checks like one one basically one case was different inputs, but one case. So Yeah, does it answer the question? Okay. Thanks Yeah, please So I Think it well it is compilable because we control this stuff So we try to make it compilable and We will run all the tests and all the mutants they are running in the they run in Child process and we control that process. So if program doesn't work well, it crashes Then we know that it's fresh if it if it happens to have like some Infinite loop then we also like we have time out and we just like catch this as well Yeah There are some debates on this in my opinion. It is good I mean the program is broken then so we broke the the program. So I think it's a good thing, but The opinions are where I Yeah, I don't know how Let's go from it, right So we we barely control the this aspect. We just Again, like make change and like YOLO. That's that's it. That's our approach Yes Say So we we are trying Okay, so the question is whether we can Just Yeah, so the question is whether we Introduce some mutations on IR level that are not possible in the source code level. Is it correct? Yeah, more or less so We are trying to not do this obviously we don't have any guarantees on this But we are well, we are trying to avoid some mutations We have some like filters and some heuristics to not do this So for example in case of C++ there are like lots of code is coming from the standard library for example it just was in line and it happened to be like the client code basically but if we mutated then it doesn't make any sense because It's unlikely that you have a problem in well In standard library that that's that's not what I wanted to say, but you get it. Yeah so There was another question I think yep, please So I got the impression that maybe for some projects if you want to get the better mutations card Maybe you start adding in order of magnitude more tests Do you have any idea on? How much pushback you would get then every time you want to make a change then for every line of code to change You haven't changed ten times more lines of tests and it slows down your project because of that I know the mutation score is good since good tests So again, it's good, but then those tests that he added should never be changed because they should be right And then if they're not it's his fault. Yeah, so I'm just wondering Any experience by anyone in practice, so I don't have any experience in practice that can contradict or Approve your statement, but it can be yeah, it can be so yeah the execution or that selection of mutants is this schedule kind of static up front where you generate Or do you do this as the testing progresses so you could for instance look at are we reaching a threshold in In which case there's no point in much longer Well, actually, you know, that's a good idea, but no we do so the phases like the first phase is analysis We guess are all the information about the whole system all the mutation points mutants and so on and then we start generating them and Executing so that might be actually Yeah, that might be a good idea Lobs Yeah Yeah, but there was like always the Trade-off. Yeah, you can miss something and yeah You should take a look and make decision then Exactly yes, yes, yeah, that's the problem why I didn't manage to give nice Analysis based on LLVM itself because I don't know many parts of it So I could not understand what's wrong test or code We have Sorry, yeah So Not not yet. Well, we do have plans like obviously, but I am they are like long-term not something that will do soon So thanks for questions