 If you follow modern practices, test coverage analysis is a lie, plain and simple. The tools are deficient, what they report is a false positive, and leaves you with a false sense of security, vulnerable to regression, and unaware that this is even the case. Let's look at the how and the why, why that is, and what we can do about it. But first, a bit about me. I hate these slides and talks, but honestly, I need new clients. I've been coding professionally for 25 years, 16 of which have been in Ruby. And the founder of Seattle RB, the first and oldest Ruby Brigade. I'm the author of many tests, FLOG, FLAY, Debreed, RubyParser, and about 95 others. I just checked RubyGems, and they're reporting at 10.6 billion downloads. I'm at 114 million downloads, which makes me one of the 1%. Oddly enough, one of my newest gems, it's called GitHub Score, one word, has 298 downloads. So if you could download that and help me become one of the 2%, I'd appreciate it. I'm a developer's developer. I love building tools for other developers. I run Seattle RB Consulting, where I do a lot of that. And this, I promise, is the most text on any slide I have, so it gets better from here on out. So setting expectations, something I like to do upfront. This is a conceptual or an idea talk, a proposal. It is meant to be for beginners on up. Anyone should be able to understand this. I don't thoroughly intend beginners to necessarily take what I'm talking about back to their companies and bring it into their systems, but anyone should be able to understand this talk. It's one gross of slides. So I'm shooting for 30 minutes at an easy pace. So let's get started. As a consultant, I get called in to help with teams that are struggling with huge and messy implementations, big ball of mud designs, or worse, design patternitis, or the over-eager use of patterns before needing them. And of course, they're often coupled with too few, if any, tests at all. And this makes it dangerous, if not impossible, for me to work on the code. Either I have to pair with a domain expert 100% of the time, which isn't realistic for me or my client, or I work alone, and then my pull requests sit in limbo for months on end. And it's incredibly frustrating to be a fixer and not be able to enact change. Being a tool builder, I have a load of tools to help me do my work. FLOG points out where the complexity lies and lets me hone in on the bad code quickly. FLAY points out refactoring opportunities and let me clean up the code quickly. DEBREED points out whole methods that might not be used anymore and helps me delete entire fields of code. And of course, Minitests lets me write fast tests cleanly and provides a lot of extra plugins for functionality. And one of those plugins is Minitests BISECT, which makes it incredibly fast and easy to find and fix unstable tests. But what if there are no tests or too few tests? What do I do then? Well, I'm not getting very far without resorting to full-time pairing or improving the tests. But I don't know the system. I don't even necessarily know the domain and I don't know what they haven't tested. I need a tool to help guide me. And that's done with code coverage analysis. So what is code coverage analysis? Well, it was introduced by Miller Maloney in the communications of the ACM in 1963. Because everything that is good in computing is old, like me. So in the simplest terms, perhaps too simple, it is a measurement of how much code your tests actually touch. But maybe a picture will help. Given an overtly simple implementation and the tests for it, you can see that the tests hit both initialize and add. And because there's no branches, no extra complexity in this code because it's too simple, that means that there's 100% coverage. Everything's been run. There's a lot of different ways to measure coverage. I don't like these terms, but they're pretty prevalent in the industry. So let's just do a quick little overview. I'll go into them in more detail later. C0 is called statement or line coverage. C1 is branch coverage and C2 is condition coverage. Function coverage is simply what percentage of functions were called at all. It's not terribly useful in my opinion, but it does have some utility. For example, debris could be considered a form of function coverage. And also, Chris gave a talk earlier today on deletion-driven development, which covered a very similar thing. Why doesn't this work? C0 or statement coverage is what statements are executed, where, for some reason, statements equals a line of code. In other words, what percentage of the lines of code were actually executed? This is actually a fuzzed example of my client's code. Branch coverage were all sides of all branches' expressions exercised. We're branching expressions or things like if or unless, case, rescue, anything where you might do a jump in the code execution. And perhaps overkill is every sub-expression of a condition exercised independently, meaning can we ensure that that is an or and expression with that precedence? There are many ways that you can go about it from the overly simplistic decision coverage, did you do both sides of an if, to the exhaustive and exhausting condition coverage with four times more example tests that you're gonna have to run, and a happy medium, which kind of looks at the fact that you're dealing with two Booleans, and so it comes up with four cases to make sure you exercise it enough. There's also a parameter edge case coverage. Given a method argument with a known type, have you tested all the interesting types of data that that type can be? For strings, these types might be the null, empty string, white space, valid format, for example a date, invalid format, something that doesn't parse right, single byte strings, and multi byte strings. Path coverage, if you're looking at the code in terms of the paths through a given method, have you walked all those paths? Entry exit coverage is like function call coverage for entry, but it also wants to exercise all explicit and implicit exits. State coverage is like parameter edge case coverage are the types of the states of your objects covered. This obviously, because of the complexity of data, gets out of hand very quickly. So that's the types of coverage that you could do. I wanna give you a warning though. There are metrics ahead, bad metrics. I didn't intend to fluff up my talk here with a Dilbert cartoon, but it turns out that Scott Adams spews borderline racist and sexist drivel on his blog. So this is not an image about gaming metrics in the workplace, thank you. Okay, so people get stuck on, and by people I mean both engineers and managers, in the gamification of anything that involves numbers. Doing whatever it takes to push up the score, even if it means having worse code. And to me, code coverage sounds like something that's really good and safe to have. It's a good thing, but it is a false sense of security. Whereas having low coverage means that your tests are certainly insufficient, high coverage means nothing about the quality of your tests. You can still have bug-riddled code, coverage and quality are orthogonal to each other. But it seems to be a common misconception that high code coverage implies good testing and therefore good code. This is an actual code given to an actual engineer in this room who was reporting a bug, an actual bug. So a simple proof, given the previous example and the associated test. If you take that assertion and simply remove it, you still have 100% coverage but no verification that you're doing the right thing anymore. And that's where TDD can come to the rescue. By intentionally writing a failing test and then only writing what it takes to make that pass and then refactoring where possible, you've ensured that you got the coverage you needed to make the test pass, but you've avoided gaming numbers. This is a natural fix via a very simple process that has many other benefits. So you should do it. So where is the state of the art for Ruby? Coverage is a standard tool that almost nobody knows about and that is because it ships with Ruby, not a gem so we don't care. It's fairly easy to use. You require it and tell it to start as early as possible as you can on the run. And then you load and run your code and you grab the result at the end. When you ask for the result, you receive a hash mapping of the path of the code in question to an array of nils and ints where nils are non-code lines like comments or blanks and end lines and zero is a line you haven't covered. But the problem with coverage is it's really not meant for use by users, it's meant to be used by other tools. But let's take a moment to see how it works. Coverage has hooks into the Ruby VM itself. When you call coverage.start, you set some internal state in the Ruby VM. Then any code that you load or evaluate after that gets laced with extra instructions of bytecode to record the coverage everywhere. And as your code runs, each line ran increments a number in that hash. Then you can call coverage.result which returns a copy of the data, turns the whole thing off and clears it and shuts it down. And that's problematic for what I'm trying to solve as I'll show later on. There's also coverage.peak result which returns a copy of the data and lets you continue. But the data stays the same which is still problematic which I will hopefully show. So there's a tool called SimpleCov. How many people know SimpleCov? That's great. You're all doomed. It's usage is entirely equivalent. You require it. You tell it to start. You require your code and your tests. You tell the test to run. You let them run and you're done. It uses coverage internally but it improves the output drastically. You get a nice overview with sortable columns. And each class has a detail page coloring the coverage so you can see the specifics. And it doesn't seem to do much else but that's enough to make coverage usable so it makes it a viable tool. Unfortunately, it has all the flaws that coverage has. So what are those flaws? I wanna describe them but first I need a tangent for a minute. We need to talk first about the types of errors that exist. For obvious reasons, statistics is very concerned with errors. So they've classified the types of errors that there are. The type one error or error of the first kind as they so creatively name it. Also known as the false positive which means that you've detected something that you should not have. If test X calls in a class X, which calls into Y but you haven't verified Y's results at all resulting in an erroneously high percentage. Type two error or error of the second kind again with the creativity. Also known as a false negative. It means you've not detected something that you should have similar scenario but if you can't map a test class to its implementation resulting in an erroneously low percentage. An epiphany I had while working on this talk was that both type one and type two errors are errors in the enumerator. This assumes that you've sampled everything that you need and statistics for some reason unknown to me as the layman seems to assume that you always will. So there isn't agreed upon nomenclature for when you haven't. I'm calling this a type three error or an error of a motion. Similar scenario again but you haven't even loaded Z resulting in an erroneously high percentage because you haven't factored in zero over whatever. Which is an error in the denominator. Okay, so we've covered the types of errors. Before I get into how I think Ruby tooling sucks I think it's only fair to say in their defense I don't think that this is necessarily particularly to Ruby. Let's get into specifics. How does coverage suck? And by extension all tools that currently rely on it. I believe there are two types of type one errors macro and micro that there are not type two errors currently but I intend on fixing that. And then there are type three errors. Unfortunately like in those cases I don't have any data to say what percentage is what or where they fall. So type one error on the macro tests hit implementation and they do their thing. But you have coverage on something that's completely untested and this is a huge source of erroneously high coverage. I see this on every project I go to. And because all of these tools are line oriented C0 is really insufficient in a lot of ways. Any execution on any line marks the whole thing as covered even if there are multiple paths through that line. For type two simply because it deals with lines and files coverage and simple curve don't seem to have this error this type of error. But as I hinted I'm going to show that they can exist and that I intend on increasing them. For type three it's entirely dependent on your sampling. You need to ensure that all implementation is loaded and therefore known by the coverage analysis. If you don't write, run or load them then you're gonna have high numbers. So what can we do to improve this? I just created a new gem called Minitus Coverage and I just released it 25 minutes ago. It also uses coverage because I can't instrument the bytecode directly yet which means that this can suffer from the same problems. It does extend the CAPI hopefully optionally and it adds a center to coverage. I haven't been able to prove that this is actually needed yet because ironically it's really hard to write tests for many test coverage. But perhaps more importantly, it suggests using a new strategy of even doing coverage analysis. The first strategy change is to record a baseline of all of your implementation not under test. This addresses type three errors. But what is a baseline? What does that even mean? Well it's our minimum starting point but ensures that all of your implementation is known about. And that means ensuring that everything is loaded. We do that by loading all of the implementation but running no tests. And that's done easily with a simple glob and require. Require coverage pruner comes in the gem. It serializes out the result while pruning out all non-project code, anything that doesn't fall under the current path. This has been wrapped up into a command line tool to make it easy. The second strategy change is that it only records the coverage of the class under test or the cut. Which means that it simply ignores any coverage that might call outside the class. And it does this by trying to map the test name back to the implementation name and modify the hash based on path. But this is hard because there's nothing enforcing this and people deviate all the time. Classes don't necessarily map to a file cleanly in Ruby. And this is all by convention but mistakes are made. Likewise, test names don't necessarily map to their implementations. So the code has to be pretty smart in figuring it out or pretty damn hacky, which is where I'm currently leaning. But just look at this. This is a real example, client code names. How smart could a regular expression possibly be to figure this out and get it right? I'm still trying to come up with a smarter way of detecting this in mapping. So that means that this tool is biased towards false negatives or increased type two errors. This can be easily addressed by cleaning up your naming. So my client was looking at code climate page exclaiming, but it says we have 83% coverage and that seems pretty good. But with tests that don't load everything because Rails, they were missing a lot of the denominator and by starting with a full baseline that dropped the number down to 51%. But as I showed before, there were a lot of naming problems. So by fixing some of those type two errors and naming that went up to a more on a 62%. Still a stark difference between what they thought they had. Okay, next. And I'm not sure if this is useful yet. I would love feedback on this. Miniatures coverage changes the runner to show each test class and report progress on each one. This also makes type two errors much more obvious because it'll say that it can't map and you won't see the percentage difference as it goes. And finally, my really basic report tool. Wow, I am going fast. I hope there's a lot of questions. It doesn't sort on percentage the way a lot of tools do. A lot of tools will tell you that you have seven files that are 100% or whatever. It sorts by the amount of uncovered lines per file which is gonna push up not the percentage but the size of the file times the percentage. This puts an emphasis on the amount of untested code that you have and where you should put your work most effectively to increase your testing. So, what's left? I mean, just released it. So why not do another release today? I wanna be able to hook into SimpleCov's Nifty HTML Reporter because it's pretty and people like it. And possibly their file format so I can hook into other tools like code climates reporting. I wanna increase the ability to see coverage live in your editor. I'm essentially done with Emacs. That is a view out of my editor but I can use help with Vim and anything else. I want better error handling for type two errors. Either I want more smarts to avoid them all together or I want better reporting and suggestions on how to fix them so that a beginner user can quickly go and rename their files or their class names so that they map properly. And I had some nifty ideas that I might want for the future. If you reset to a baseline and record coverage on a per test basis, then you can map lines covered back to each test. Then you can use that information to show heavy overlaps and where they came from. And if you do that without the cut filtering and you record everything you've touched and you couple that with some sort of nifty visualization, then you can readily identify places to isolate mock or stub to have the greatest effect on isolating your tests. So that's where I'm at. The code is available at the year below. I'd love some help with it and any feedback on it is more than welcome. And if you're in need of an experienced tool builder, code analyzer, trainer or troubleshooter, please get in contact with me because I'm available. Thank you. That was way under time. I could have fit a whole other talk into there. I can improv one if you'd like. Are there any questions? All right, I'm gonna repeat this not so the audience can hear but so that I can prove that I understood your question. Are there any tools that you can use to look at the diff on your current branch and figure out if that maps into untested code? None that I know of. I have heard of some tools that do know how to work with streams of diffs but I don't know if any of them have been applied towards coverage analysis. That'd be interesting. Does dead code affect the calculation that you're using baseline? Absolutely. So if you're using a baseline, you're going to know about the entire implementation and therefore you're gonna know about any dead code and whether or not that dead code is tested or not is really gonna affect whether it affects the numbers. But given that like for a Rails app in production mode that code's gonna be loaded anyways, it should be under test or it should be deleted. Especially given that there are so many entry points into any Rails method that you may or may not know about. It's better for you to either make sure that it's tested or deleted. Do you wind up with any load order issues while creating the baseline? Yeah, sometimes. What I've found, the code sample that I showed was setting the Rails into production and then loading config environment which should go and load all the models and the controllers. And then I go through and I load all the stuff under app and under lib anyways in case something's not known about. I don't know the actual loading mechanism in Rails. I don't know if they just do it by path or if they do it by mapping through routes. So I went for overkill. The only place that I've been bit by it is when you're hitting a concern multiple times because those will raise errors. And I think what I wound up doing with some path filtering to avoid those, assuming that they were gonna come in via the includes and the const missing mechanisms regardless. Yeah, on the back right, have I considered the effects of running in parallel on the CI? No, I have not. One thing that this does do that I probably should have illustrated cause it'd be more slides and more time is the fact that this is really happy doing multiple runs and then combining the results at the end. So you already have to combine against the baseline. So you create the baseline JSON file and then you do a run of your units and then you do a run of your functionals and whatever else you have. And then you report listing all of those on the command line and it does a merge, hash merge over the whole thing. As far as actual parallelism goes, if you're running your tests in parallel or whatever, I have no idea if coverage itself is thread safe. I suspect it is because the actual internal data structure is internal to the VM itself. And so I suspect that there's only one access to it at a time anyways. So it shouldn't matter too much. And then at that point, it should be completely safe. That's making an assumption that coverage is thread safe. So my problems are Ruby's problems. Any other questions? Cool. Well, hopefully I'll get some bug reports and feedback from you all soon.