 So, this is stop writing tests in which I will be somewhat provocative, but I'm also kind of serious about this where I think that while many projects are under tested, we should nonetheless spend less time writing tests by hand than we do at the moment. But before we really jump into it, I want to start with an Australian tradition called an acknowledgement of country. This image is my hometown of Canberra, I live just off to the left and work kind of in the central foreground at the Australian National University. But before the town of Canberra was here and before the Australian National University was here, this was the land of the Dunwall and the Namburi peoples for tens of thousands of years. And I want to pair up with my respects to their elders past and present and their leaders emerging and acknowledge that their land and waters were never ceded. The main body of my talk though is about testing. And so it moves me to quickly define testing. It's that thing where you run your code and check whether or not it did the right thing. Usually we're either looking to find you bugs or checking for aggressions. These are often kind of separate activities. Even if we use the same tools for each of them, the workflows are often quite different. But fundamentally the checklist goes like choose inputs, run the thing we want to test, check that it did the right thing or that it didn't do the wrong thing and then repeat as necessary. So let's use an example. In deference to my friend David, who probably doesn't want to drink quite so much as this tweet would indicate, instead of talking about sorting and reversing a list, sorry, we're going to look at sorting a list as our example. So here's some simple tests we might write for the sorted built in, you know, if for some reason we've lost all trust in the Python Core developers but still use Python. We can see here that if we sort the list 1, 2, 3, we get the list 1, 2, 3. And if we sort the list 3, 2, 1, we get 1, 2, 3, even if they're floats and that preserves the output type. And we can see that we can sort strings as well as numbers. So that's kind of nice. And if we're thinking like don't repeat ourselves. Well, like open software engineers, we might use a PyTest parameter trials. So in this case, we've got semantically the same test. But we list out our input and output data and then have a kind of data driven or table driven parameter trials test. So this really helps reduce boilerplate. Okay, to be honest, when you've only got three, it hasn't helped much. But it makes it much easier to add further input, output pairs in the future. The problem is that this isn't really automated testing, like the mechanical Turk of contemporary Amazon or of the 1800s magician's trick. What we have done is not so much automated a process as hidden the human labor involved in it. And I'm going to claim that not only is this not particularly automated, it also doesn't scale particularly well. So what can we do that would make writing these kinds of tests easier? Well, one thing would be to go, is there a way that would let us get away from having to define the output? So we only had to think up the inputs. Because here, remember, we have to define by hand, what is the correct result for every possible input? So here, we only have to come up with the input. And by comparing it to a trusted equivalent sort function, we can automatically check that it behaved correctly. So we've already saved ourselves a chunk of work with this approach. And you might be thinking like, how often do I have a trusted alternative version? Well, every time we're refactoring the version before the refactor and the version afterwards should do exactly the same thing. Or if you have multi-threaded code, if you run it with one thread or with many threads, it should do the same thing. Or even you might have like a mock version of your database that fits in memory instead of being a distributed system. You can check the equivalence of these kinds of things as well. But even if you don't have that kind of thing, all is not lost. We can leverage particular properties, that's why it's called property-based testing, of our functions. And so for the sorted built-in, we know that no matter what the input should be, the output should always be in ascending order. If you take the pairs of numbers or elements in the output, then the first one should always be less than or equal to the second of every possible pair. Do we think that this would be a sufficient test for the sorted built-in? I'm going to go with no because return the empty list is a great performance optimization which would pass this test. So we might want to say, well, if the output is in order and we have the same number of elements and we have the same set of elements as before, then we know that we've sorted the list correctly. Does this one seem right? It turns out this one is also kind of subtly buggy because if we had the list, for example, one, two, one, we could replace it with the list one, two, two, which would have the same length, the same set of elements and be in sorted order, but would not be a correct sorting function. So we could use the mathematical definition that it's a permutation of the input such that it's in sorted order. The only problem with this one, though it is a full correct test for sorting, is that it's hideously slow. So we'd really want to use the collections.counter class. And I think this is a pretty good test for sorting. In the process, we've kind of rediscovered the idea of property-based testing that we can check whether our code is buggy without needing to know exactly how to re-implement it. And in this case, sorting is fully specified by just these two simple properties that the output should be in order and that the result has the same elements as the input. I do want to note, though, partial specifications, even if you can only test one of these or you can test for some kinds of bugs but not others, still super useful and more tractable on more complicated kind of business-logicy code. The remaining problem with this is that no matter how hard you think, many bugs are actually caused by the interaction of our code with inputs or situations that we never expected or never thought of. And that means that the bit where we have to write the test cases by hand, you know, one, two, three, three, two, one, BCA, we're just gonna be stuck there because if we write all of the things we think of to test, kind of by definition, we've probably also written code that handles the things we think of. And that's where my library hypothesis comes in. The job of hypothesis is that if you describe what kind of data should be possible, hypothesis will find particular examples that you wouldn't have thought of. And so here, we've written exactly the same test body but instead of a PyTest parameterized, we're saying from hypothesis, use the given decorator to provide inputs and the argument that that provides should be either a list of mixed integers and floats or a list of unicode strings. And even if this seems like a pretty direct translation of what we were doing before, this test will fail. And this test fails because there's the floating point value, not a number, which compares unequal to itself. And so it turns out that if you try to sort a list containing NANDs, Python will sort each of these sub lists but won't reorder anything across an NAND. Which is kind of wild, but it's unclear what else the behavior should be. So the short version is I want you to be able to and then to actually do it adopting property-based testing. The foolproof plan for that, you just pip install hypothesis, you skim the documentation, and then you find a lot of bugs. I hope you're into that kind of thing. To be more specific, like hypothesis has minimal dependencies, just two pure Python libraries that we use for some data structures. And it works on every supported version of Python for the Python Software Foundation from 3.6 to 3.10. It can be Cognit installed if you like Cognit and so on. So let's have a look at how you would actually write some tests. And when I say how you would write some tests, like the problem with this one is that you still have to write the whole thing by hand. And I'm not into that. So let's look at the Ghost Rider at a way to let hypothesis write your tests for you. Once you get around to pip installing hypothesis, you can check out the hypothesis shell command. And if you ask hypothesis to write your tests for you, you'll see that there are a bunch of different kind of things you can do. So you can write things based on type annotations, so they're strictly optional. Bunch of examples you can write in PyTest or UnitTestStyle, but let's just jump in. Let's write a test for our sorting function. Of course, that should be sort-ed. And so hypothesis spits out a test which reminds us that the test I wrote by hand earlier neglected to consider the key function and the reverse flag. But in the body of the test, the Ghost Rider doesn't have any particular knowledge of sorting, so it just calls the function and hopes it doesn't crash, which is a pretty good start. I would usually use this as a template to actually extend myself. But you could also tell hypothesis that the sorted function is idempotent. That is, if you call it on its output, the result should be the same as the first one. And so there we are. That's a complete test for sorting. We could also test that our two functions are equivalent. Now, I don't have a trusted equivalent to the sorted function for the standard library, but the eval built-in should be the same as the ast.literal eval function for every string which represents a Python module. So let's see what that test looks like. Hypothesis spits out where we have our global and local namespaces, which can be none or if you want to upgrade the test at dictionary or a namespace of things. And then we have the node or string and the source arguments. And in this case, that's because the names of the arguments to eval and to literal eval actually don't overlap. So you need to edit that down a little and probably use my Hypothesmith project to actually produce valid source code. But I feel this is a pretty good start. The other cool kind of properties, and this is where the name property-based testing originated in Haskell about 20 years ago, was from properties of things like binary operators. We have associativity, commutativity, identity elements of sets and so on. To be honest, I don't use this very often, but if your functions do have these kinds of properties, it should be really easy to test. We can also have a look at round trips. And this is the last general category of property. And in particular, this is the one that you should probably all think about going away and using. A round trip property is where you call a function and then you call some other function that undoes that. So in that case, Hypothesis goes, well, if I can press and then I decompress, having found that matching name in the same module, then I should get back the original input. This is a really general and really powerful form of test or design pattern for tests, because we do it all the time. We encode, then we decode our data. We serialize it and we deserialize it. We save it to a database and retrieve it from a database. We send it across a network and receive it from a network. And in each case, these round trips tend to cross a lot of layers of our stack. They tend to operate on our core data structures and they are often absolutely crucial to get right. If saving your data to your database and bring it back out gives you different data, you have a critical problem on your hands. So let's look at a more complicated case than just compression. How about JSON encoded? So we'll dump it to a string and then load it back in. And in this case, we can see that Hypothesis has found all of the optional keyword arguments to JSON encoding and decoding, which I usually ignore. And it turns out there are many of them. So I've put together just a short test file, which I edited down from that by hand. Where I said, well, the important bit here is I just cut out a bunch. And I said that the object is JSON. And that's defined recursively as the base case for JSON is none, true or false, a number or a string. Or JSON can be lists of any JSON or dictionaries of any string 20 JSON. Sounds good. Anyone think this will pass or will it fail? It fails. And here, Hypothesis actually shows us two distinct failing examples with different causes, which I think is pretty cool. The first is that if we allow NAND, but then we pass NAND, oh, then our assertion fails because NAND is not equal to itself. Well, that's gonna be easy enough to fix. But the other one is that if allowNAND is false and our JSON object is infinity, then we get this value error. And when you go and dig into this, it turns out that the JSON spec doesn't actually allow non-finite numbers. And if you pass the confusingly named allowNAND flag to false, then Python will reject non-finite numbers in your JSON encoding. So to fix that, we'll try using the Hypothesis assume function. So first of all, we'll say like, okay, we just always allowNAND and infinity. And then we'll assume that the object is equal to itself. If this is false, it's kind of like an assertion, but the error just tells Hypothesis that that was a bad example. Try something else. And if we were on PyTest on this one, what do you think we're gonna say? Still fails. I was very surprised when I first found this one, putting together a talk demo. It turns out that if you have a list containing not a number, it compares equal to itself because lists have a number of performance optimizations in equality where they try to short-circuit things. So if the list is equal to the other list by identity, then it will always compare equal by equality as well as a performance optimization. And for each element, if the corresponding element is equal by identity, it won't bother comparing it by equality to save you on kind of deeply nested comparisons. This is usually great, but when you have JSON and NAND, it gets pretty confusing because it turns out even if you call list on a list of NAND, so you get a different object, the element is still the same by identity and they compare equal unless you round-trip them through JSON. It's kind of bizarre. So the proper way to fix this is actually just to tell hypothesis not to generate NAND. If we wanted to test the semantics of JSON round-tripping with NAND allowed, we might write a more complicated test. And if we run this one, hypothesis passes and considerably more quickly because we're not trying to find that minimal count, for example. All right, so much for the ghost runner. You may, however, want to port some of your existing tests rather than just throwing out absolutely everything and starting over. So I'm gonna walk you through what it might look like to take tests for something like Git, which I think of as kind of business-logicy in that there's a lot of just weird arbitrary behaviors of Git, which match some kind of other model, but it deals with state, it deals with files, there's not like a clean algorithmic thing that you could really do for the user-facing part of Git. So let's start with this test, which says that if you check out a new branch, that makes it the active branch. Which goes, we set up a temporary directory, we initialize a repository, we check out new branch, and then that should be the active branch. Well, the first thing we could do to make this a little clearer is to pull out new branch as an argument to the test. We've set it as a default value, so there's no semantic change, but this does make it a little clearer to the reader that the specific value of the branch name shouldn't affect this test. And then if you want to start using a hypothesis, you can say, hey, hypothesis, generate a branch name, which will only be new branch, and run the same test body. Still no semantic change. And then you can pull that out into a function, shared among your test suite. And if you've only got a single test, okay, this doesn't help much, but if you've got many tests, then this means that you can share improvements or discoveries about what kind of data should be valid for particular sorts of inputs or models between all of your tests. So you get a kind of M plus N rather than a multiplicative scaling problem when you change things. And now we come to the tricky bit, where we actually have to think about what are valid branch names? Because if you run this, for example, you'll discover that the empty string is not a valid branch name, that a white space-only string is not a valid branch name, that git branch names can't start or end with dashes and a whole bunch of other complicated constraints, which you can check in the git manual. So for simplicity, we could say that a branch name should consist of only ASCII letters between one and 95 characters long. If you go over 95 characters, then certain web hosting things start to reject your branch name as being too long. That was a fun one to discover. And then we could come back and look at this, where we're still saying that, well, do we really mean that this should only ever be true with a newly created repository? So the final test that I'll be trying to make great towards is something like this, where we say that given any valid branch name and any repository, if we assume that the branch name is not already a branch in the repo, then we check it out, that that branch name should be the active branch of the repository. Sounds pretty good to me. And I actually find this test a lot easier to read as well as a lot more rigorous, compared to the starting point that we found. The final thing I want to talk through is coverage-guided fuzzing. And this is where we get a little smarter. The hypothesis engine by default, as you would use it when running your tests or in CI, has a combination of some feedback and some really good heuristics plus a lot of random search. Coverage-guided fuzzing basically adds an evolutionary or genetic algorithm to that. And hypo fuzz is designed to kind of compliment your CI-based workflow. So your CI workflow or your local tests can then be dedicated to searching for regressions. And you can use this more powerful approach with extra feedback to search for new bugs. It's also got this nice feature where because it uses the same database as hypothesis to save all of the failing examples, to reproduce anything it finds can be as simple as literally just running your tests locally. And it can pull that out of the local file system or regress or whatever else you wanted to use. Let me pull up a live version of that that I have. So here's the live hypo fuzz dashboard where I've set that running on one of my own projects. And if I just zoom in on the early part of this, you can see that there's this kind of classic pattern where we logarithmically approach whatever it is that our steady state seems to be. But if we really zoom in on one of these, we can see we're still discovering new behavior, new bits of branch coverage as we go. The big difference that coverage-guided fuzzing makes is that when we discover one of those very rare branches by chance, we can then try variations or whatever that thing was because we noticed that something was different. If you look on a log axis, you can kind of see that you get this more or less straight line plus the leveling off later. If we scroll down, this has been running for about half an hour now and it turns out that this was in fact sufficient to find a bug in one of my libraries. Yeah, I'll go fix that later. So that's my talk where I wanted to argue that stop writing tests doesn't actually mean stop testing, but it means that we can hand over much of the job of testing to better tools, better libraries and spend perhaps a little less time while still getting a great deal more rigor. So thanks very much, and I'll see you with the chat for Q&A.