 So just before we start, there is an announcement that Brenda Wallace is ill, so there will be a schedule change. This talk of this room after lunch will now be Hear No Evil, See No Evil, Patch No Evil by Graham Dumpleton. So without further ado, I'd like to introduce Clinton who is joining us all the way from Brisbane and works for Bloomberg. And we'll be talking to us about getting more testing with fewer tests. Good morning everyone. I love the setup with the little peanut gallery over here with all the special people. Those two in particular. So like Lee, I am from Brisbane. My talk will be substantially different. There are no feral cats in my talk and there's no colour as well. This might be some of my personality coming through. So is that audio a little bit high? I'm not good at talking at a module time, I'm afraid. So if anyone's got any questions, please raise your hands and I'll answer. Over here I'll probably won't see it, so I'll just ignore that. So if you're starting to lose me at any point in time during the talk, raise your hand and I'll try and get everyone on the same track and I'll just repeat the questions so we don't have to run around with microphones. Yes, so testing is a good thing. We should be doing more of it. It sucks, it's hard to do at certain points and times and you sort of get yourself into the wrong mental mode for doing testing. And that's sort of the thing that I want to concentrate on at a little bit more of a meta level. So essentially this talk is about unit testing. That's where this particular framework and this particular style of testing is most relevant but it can work for the other stages, integration system level testing. But we're definitely going to be focusing on unit testing. It's really hard to remember to go over here and give eye contact with you guys. So if you're definitely on the shy introverted, go sit there. So unit testing, the basic unit testing route, obviously you write your code, you write a test case, your only test case, it passes or fails and then you do it again and again and again and again. As engineers, pretend engineers is what I introduce myself as. I have worked on hardware stuff but when I'm a software person I'm a pretend engineer. Whenever you're doing something over and over and over again, it's the sort of thing that you should be automating because that's what we do with the rest of the time. When you're writing test cases, there are two sorts of test cases that you're writing. Your generic cases, so all of the inputs that you kind of expect your functions to get and then you get your corner cases or as I call them your user cases because users put in all sorts of interesting values that you do not expect. So as a really simple sort of motivational example, a simple little function to give you back the average of a list of integers. I'm trying to write all of these code snippets in a way that eases reading, not necessarily that it is like pep 8 friendly, so little things like that. We're pretty close to the C so if there's any pep 8 violations, go take a swim. So it's like this is actually really hard writing code for slides and trying to minimise the number of lines of code and making it so they're easy to read and doing no pigmentisation stuff. So really simple function, takes a list of integers, gives you back an average. Now just as a show of hands, can anyone see any problems with this function? So your typical test case that you might write for such a function, you import your code, you give it your online six there, we've given it a particular import. We check for the particular output. Because it's floating points, we've got to worry about floating point stuff. This is not a floating point talk, I don't want to hear any comments about absolute position versus relative position, I'm never going to give that talk, it's someone should give that talk, I'm not going to do it. But you're dealing with numbers, so you're dealing with floating points, so at some point in time you have to think about these things. Property based testing is a little bit different. And property based testing is a language, agnostic term for this style of testing. Hypothesis is the Python implementation of property based testing, but a lot of this stuff, a lot of these ideas you can use with different languages, but hypothesis in my experience is the nicest library for doing property based testing across all of the different languages that I've seen. So essentially you write one test case, hypothesis will generate a random or semi random inputs for you, you tell hypothesis what properties match between the inputs and outputs, and hypothesis will verify those properties match. So the workflow is that you write your code, you write your properties which describe your code, hypothesis generates the data, it runs the test case and then it checks the properties. And something that's really important here is as with handwritten unit tests, you need to check your code coverage after you've written a whole bunch of test cases because that's the obvious way of figuring out which parts of your code you haven't tested before, and hypothesis does not stop you from having to do code coverage at all. But I thought I should throw that in as a particular thing to watch out for. And I think the other point is that using something like hypothesis doesn't actually make testing easier, in a certain regard it makes testing harder. When you get into a state, a mental state of writing test cases, you often sort of get into a rut where it's like, well, I'm writing test cases again, I haven't been here before, and it's like, oh, well, this function takes a list, so I've got to write a test case that makes it so that this function gets passed an empty list, and then I've got to write a test case where it takes one, and then I've got to write another test case where it takes 15. And you've gone down these routes so many times before that in essence, your brain switches into neutral, and you stop thinking. Any time where your brain stops thinking, you don't need your brain, and you can automate it. So here is a very basic properties-based test for my wonderful average function. So on line one, we import the given decorator. The given decorator, or the function given, is the main API that you'll be using to hook into a hypothesis. The other APIs that you'll be using a lot are the different strategies. Essentially, for every Python data type, there is a strategy that will help generate random, semi-random values for that data type. So for integers, for example, there's a strategy that will produce interesting integers that you always want to test in your unit tests. So it will generate your zeros, your ones, your minus ones, your stupid big numbers, your stupid little numbers, and your special numbers, which we'll see. And there's a list strategy which will produce an empty list, a list with one. There's just a few things, a list with a lot of things. We're importing our code to test. On line six is essentially where we're doing the hypothesis call. So we're saying here is our test case called test average. And we're using two strategies in tandem here. So we're using the list strategy to generate a list of integer strategies. So we are calling our function. And then the property that we've given it is that we're fairly weak property. But we're basically saying that the average of the list has to be less than the maximum and greater than the minimum. So everyone who put the hand up before showing one of the problems with the code should realize what's about to happen when we run this test. And the nice thing with hypothesis is that it hooks into all of the Python test frameworks. So nose, py test, run unit test, all those things. So in the vast majority of cases, after you've written your hypothesis test, you've run your test suite as normal. So you run it, there's a lot of output. And right at the bottom, what hypothesis is done here is it's found a falsifying example that breaks our property. And the particular input is where the list events is an empty list. That's one of the strategies that the list strategy comes up with. It passes along an empty list and we discover that there's a problem there. So hypothesis there has discovered the divide by zero problem with my code. I haven't had to think about the issue of passing along that empty list. I haven't had to hand generate a test case to discover that. In most cases, because the way that you would handle the empty list is by raising a value exception, you'd probably end up writing a unit test, like a single unit test with a regular input and output to catch that special case. But in this particular example, hypothesis has discovered that special case for us. So we still want to be able to use hypothesis and we just want to tell it there's a special case here that you can't use hypothesis for it. So this is exactly the same test code as before, except on line six, we are passing along the minimum size one attribute to the list strategy, saying you can generate lists of any sort except they all have to be of at least size one. So how many people picked up two problems? Yeah, there's always one. So I said this wasn't a floating point talk and I kind of lied. Because when you're using floating point, you're using floating point, so you get all of the floating point fun. And the problem here is that in my property, there's a little bit of coercion going on between the types. And if you have a look at the very last number in that assert line, you'll see that when Python has done that automatic conversion between the integer minimum and the average, we've added and minus the digit there. The reason that that number has been put in the... So you see right at the bottom there, the list of ints is just that one big number. The reason that hypothesis has given that number is because it knows that that's one of the funky numbers, that when you do that automatic conversion between ints and floats, it goes a little bit haywire. Hypothesis is the codified knowledge of all of the little corner cases with the language and with floating points in particular. And it's a whole bunch of ways of generating test cases that will trigger on all of these little corner cases. So the way to get around that is to fix my property. And you'll see the only difference here is on the very last line where I do the float conversions myself. That stops Python doing automatic conversions, which means that the 2s and the 3s end up in the right order. So I run that test again and we get no failures. Now, the interesting point here is that I'm not actually testing the average result as strongly as I could. I'm using a fairly weak property, just saying that the average is between the minimum and the maximum. And there's obviously much stronger properties that I could be testing. But using that particular property shows that you can test things other than the results. And it also shows that you really do actually have to think about the properties that you're testing. So you have to engage your brain a fair bit more for doing some of these tests. Property-based testing is not easier. All it does is it means that the artifacts that you check in to your code base for your tests are smaller. It doesn't mean that you have to disengage the brain. It doesn't mean that you disengage the brain. It means you have to use your brain more. It just means that instead of having thousands and thousands of handwritten test cases, you've only got a few. So when you're running the test cases, there are a couple of verbose options that you can pass along and ask hypothesis, please print out all of the test cases as you're running them. And when you're first using hypothesis, this is great because it really just shows you under the hood all the test cases that it's generating. And after about five minutes, you realize there's a complete waste of your screen time and you stop doing it. But this is an introductory talk, so it's obviously very useful. So we see here that the list strategy, it's generating because we've asked it to generate lists that have at least one. It's generating lists with zeros and big numbers and negative numbers and stupidly big numbers and stupidly small numbers. And depending on your data set, all of those features can be tweaked. So you can always easily say, generate me a list. It's got to have five items in it and they all have to be positive. All of that sort of stuff is very easily covered. So some example properties that you might want to think about, if you write a function that sorts something because we're all into writing functions that sort stuff, you can assert that the output is actually sorted. So writing the sort is kind of hard, but writing a property that says the output list is sorted is relatively easy. And reversing of a string, you can assert that the string is reversed. And one of the interesting things with those sort of functions is that you can actually call that string reversing function twice and you get back to the same input. And what you can use with those functions particularly that mirror themselves or they have a pair of functions that work in tandem is that you can basically describe your test case without having any reference to your inputs or outputs. And you can just let hypothesis generate random inputs and outputs and really test your functions. So who likes fax machines? Yep. So you've got white paper. You've got black bits and pieces on it. On a typical one scan line of a fax machine, it's all going to be white and the occasional black. So a really simple way of compressing that over the phone line is called run length encoding. So basically instead of printing out all the W's, we just say we've got three white bits and then one black bit and then five white bits and then three black bits and then five white bits. And this type of compression works really well for things of this sort of nature. We've got a quiet background and then the occasional data point, fax machines used it. It's still actually in use in JPEG. Interestingly enough, where you see in the background images, you'll often see big square blobs. That's actually where run length encoding comes in. So I don't want to get too caught up in the code here, but this is a fairly simple compression function. So we just run over the string. If we see a W, we can it up. If we see a black, we can it up and spit it out at the end. It's really just a loop over the input. And the decompress is pretty similar. We run over that string. If we see a number, we grab the last string and multiply it. And yeah, there's a couple of reverses and things like that, but it's not too complicated. The really nice thing with a pair of functions that compresses and then decompresses, zips, unzips, encodes, decodes, is we've got an API that is reversible. So you can pass the input to one function, then grab the output of that, pipe it into the other function, and you should come back with the same input. So it's really easy on line 10 and 11 to write properties about a compression function. So the properties here is that the compressed form should be smaller than the uncompressed form, which is a really nice property to have. It's not always true for things like B-zip and stuff, because you've got blocking and stuff. But for run length encoding, if you write a run length encoding scheme and your output is bigger than your input, you've got a bug. And the really nice one is on line 11 here, where we take the input, we compress it, we decompress it, we should have exactly the same output. And nowhere in that test have we even bothered looking at the input. So we're using a text on line 6. We're using a text strategy. And one of the things that we can tell a text strategy is here is the alphabet of letters that you can use. And I basically restricted it to saying you can generate text blobs that only contain w's and b's. So it matches our scanning our fax machine example. It's going to be great in 10 years time where it's like, who likes fax machines? And we're like, nobody will know, except if you're working for government. So we've got our two functions. We've got a property. We're not looking at the input or the output. So our hypothesis can go nuts generating the inputs for us. We run it and we don't have any problems because my coding is awesome. The problem, however, is that we're passing these randomly generated inputs to one of the functions. We're not pass to the compression function. We're not passing that randomly generated input to a decompress function. So that is an issue right there. From a security point of view, we're not actually throwing, we're not doing any sort of fuzzy testing on our decompress function. So it is something to think about. And like before, I just wanted to show some examples of the output. So we've got a w and a b and an empty string. And then we've got all sorts of random combinations of w's and b's that hypothesis has generated for us. Every time you run it, you get different outputs because it's randomly generating them every time. So unit-based tests are really where hypothesis spends most of its time. There's a different sort of test that you can write with hypothesis. So hypothesis is really at home where you're testing one function or you're testing a pair of functions that work in tandem with each other. If you've got a suite of functions like an abstract data type that are all working around a bit of state. So you've got a collection and you're adding and removing and then you're walking over that data type. Hypothesis has really good support for state-based testing. And instead of getting a hypothesis to generate random input and passing that to your functions, what you're actually getting hypothesis to do is to try to generate a random program that follows the lifecycle rules of your API. So in order to use your API, you've got to create your collection. You've got to add something and then you've got to delete it. Fairly sort of standard lifetime there. So the hypothesis implementation of state-based testing, it's very good. It's still quite new and there are a couple of points of flux and there are a couple of points in the API that kind of are a little bit underpowered that I'm trying to figure out how I can fix, but I'm going to alight through some of those. So the simple data type that I'm going to look at here is a simple binary tree. So every node that's in the tree, everything on the left of that has to be less and everything on the right has to be greater than it. So not terribly complicated. So the square nodes at the bottom, those are our empty trees. So if we iterate over them, it's the same as iterating from an empty list. If we insert into them, we return a new tree which actually has a value and if we delete these empty trees, something's gone wrong because there's no value and you can't delete an empty tree. So a node that has a number, we're calling that an actual tree. So fairly simple. We have a value and we've got two empty nodes beside it and if we're iterating over it, we yield everything smaller than NSLs, SLS and everything larger than NSLs. Insert starts to get a little bit interesting because we have to figure out if we're inserting to the left or to the right, but it's a nice recursive thing and then delete gets a little bit more complicated because we have to work out if we're dealing with a real tree, a not real tree, a bigger than tree, a less than tree and our simple data structure all of a sudden has quite a fair chunk of code to it. So I would like to show how you can test something like this with hypothesis. So obviously the code for this is a little bit larger than the unit test examples, but it's not too much larger. So we are importing, on line four we're importing our test strategies same as before. We are importing stateful testing stuff from hypothesis. So we have a state, a rule based state machine. So these will codify the rules saying before we can do anything with that data structure we have to create it, before we delete something from a data structure we have to add it. Those are the sort of rules that rule state based machine is going to remember. The bundle is the actual state that we're going to pass from test to test. So it's actually going to be the data structure that has values in it. And the rule is the given API call for the state based machines. It's going to be the main hook that we're going to be using. And the empty tree is our starting point for our API. So all of our trees start with an empty tree and then you insert values into them. So our test here we've subclassed it from the rule based state machine. The state we've going to store in a variable called trees and we are using the bundle for that. And then we have rules and rules are functions that are run on one of the states of the test machine. So one of the rules is a leaf. So we take the sets that we've got and we are adding a new state to our target. So we're taking whatever states that we currently have and we're adding a new empty tree to it. So that's our leaf rule. So we can run that leaf rule at any point in time because we're generating a new data structure. We have a node rule where we are taking the input which is our tree, we're taking the output which is our target and we're also using a integer strategy. So we're taking one of the nodes that we've made in the leaf and we are inserting a number into it. So we're modifying the state again. We're taking one of the empty nodes, adding one of the nodes and putting a number to it, putting that back into a state. One of the rules here, the check sorted, you'll see that it doesn't have a target on it so it's not modifying the state, it's just looking at the state. So all we're doing here is that we're iterating over the tree and making sure that at whatever state that this tree is in, it's sorted. So with those three rules and particularly the node rule, hypothesis is going to generate a forest of random binary trees. And it's just gonna keep adding numbers to them and you're gonna get some trees that are gonna have all of the numbers are gonna be randomly increasing order. So all of the nodes are gonna get in one way. You're gonna have other times where all of the numbers are gonna be decreasing. So they're all going to go down one way. You're gonna have some trees, which are going to be evenly sorted. So all of those complicated switches in our inserts and our deletes, they're going to get tested at some point in time if you give hypothesis enough time to generate enough different states. So this is our rule to delete. Part of the same test. Things get a little bit interesting here because in order to delete something, we have to assume that the random number that we've generated is actually in the tree. And this is actually one of the weak points of the hypothesis set up at the moment. There are nicer decorators for doing this, but you can't actually have a, you can't peek into the state using the decorator and that's something that I'm looking at doing. So at this point in time, you sort of halfway through the test, you're running your assumption and then if the assumption bails out, you just pretend that this test didn't happen. And that's a little bit ugly because it means that instead of generating a whole bunch of test cases that definitely work, you're generating a whole bunch of test cases and only half of them actually work because a lot of the times it'll be generating random numbers that are not in your tree. So with those three rules and this fourth rule here to delete, we are generating any number of random trees and inserting and deleting numbers from them. And if you think back to how convoluted our delete case was, if you give hypothesis enough time, it's guaranteed to generate trees that will test all of those different branching examples. And we didn't once manually generate a particular input. So the way that hypothesis works is that every time it's calling one of these rule-based functions, it's actually running delete multiple times. So the strategy to generate a number, it'll call that 50 times or 100 times. And there are different modes that you can set hypothesis up into. On your developers laptop, for example, if you're just running all of your test cases locally, you'll have a fairly short time period on it. So it'll be like generate 50 test examples and it will run those 50 test examples and then say, yep, they all passed. If you're running this on your continuous integration server and it's running overnight or something like that, you might say, hypothesis, go generate test cases for 10 minutes and that's totally fine because it's not interrupting developer time. And in that case, it's gonna generate 100,000 test cases. So hypothesis is not guaranteed to generate the same results over and over again. If you run it, you might come across, if you run it 100 times, it might say everything is fine and it's passed and it could be on that 100 and first time that it generates the test result that actually fails your example. Now, what hypothesis we'll do in that case is that if it finds a failing test example, it will write that out to disk and the next time that you run the code, it'll keep running that failed example until your code passes with it. So it's the nature of this sort of system where it is generating test cases that in order to over time find all of the issues, it won't generate the same test cases all the same time. One of the really so essentially random that you can, particularly with a lot of the functions that don't touch any of the test, touch any of the state because it didn't have the target equals trees so it doesn't change the state. Hypothesis will look at those and assume that it's a checker function and it will run in general, it'll run all of those checker type things across the state at all times. But that's more of a heuristic rather than a rule. So one of the really interesting examples for the state-based stuff is Mercurial. You can think of a version control system as just a lump of state with an API on it and that's how Mercurial works. It's all written in Python. The command line tool is really just calling a whole bunch of Python APIs underneath it. So there are certain rules about using a Mercurial repository. Before you can add, before you can commit something, you've had to add it. Before you can add something, you have to recreate your repository. So there's a whole bunch of state and rules associated with it. Mercurial use hypothesis is part of their testing framework and it's uncovered two particular bugs. In this case, the bug that it uncovered is if you make a HD repository, make a branch, commit, and then do a shelf, if the branch name had a slash in it, internally, shelf treats that branch name as a file name where it's gonna scroll away things and it had a bug where it treated the branch name as a file name. And all of a sudden, when you put a slash in there, it means that it can't find the directory that it hasn't made because it shouldn't be treating the branch name as a file name. And hypothesis discovered this because over enough runs, the text strategy for generating the branch name happened to put in a slash. And there's another interesting bug that hypothesis has found, but this is one of the test suites that hypothesis use. Sorry, it's one of the test suites that Mercurial uses on this CI system. So basically, every time they're ready to do a release, they just say hypothesis, go generate a whole bunch of test cases and see if you can find any brokenness. But I think Mercurial is a really interesting example of what a state system can be because it's got stuff on disk, it's got stuff in databases, there's a whole lot of state behind it. But when you have a look at it from an API level, it's actually fairly simple API. Oh, I really should have been upfront about this. Hypothesis is not my creation. I'm probably its biggest fan. I don't think enough people know about property-based testing. It is the majority of the work is by David McIver. He's had an interesting couple of years. He was working on the hypothesis in his spare time, discovered that a few commercial organizations were using it and he was sort of, you know, eating out of the bottom of tin packets and stuff. He got a little bit grumpy about things and sort of put it on the shelf and said no more of this open source stuff. He's discovered that he can make some money out of this by running training courses and doing ports to different languages of hypothesis. So he's doing a port of hypothesis to Java at the moment. Because it turns out if you can help people stop writing Java and mountains and mountains of Java unit test cases, they really like that. And that's all sort of David's personal story and I recommend you go and have a look at his personal blog to see how he dealt with basically like a year and a half of burnout at work and on open source stuff. But he's continuing to work on this project now. So I have touched on the CI question where there's a whole bunch of settings with hypothesis, how long it takes to generate the tests, how many tests it's gonna run, if it can store those tests and how to pass those tests along. There are a lot of integrations with other frameworks. So there's integrations with Django, for example, where you can basically say, run this test case. But before you run this test case, set up the database so that I actually have some customers in the customer table and I've got some sales in the sales table and actually link them in. So if you've got any sort of database integrity schemers where you're actually saying that these tables have to reference, the sales table actually has to reference the customer table and things like that, the integration, the Django integration will do that for you. And there's also an integration with NumPy that basically says, I need a matrix that, I'm not a mass person here, but I need a matrix that needs to have this particular determinant and it needs to have an overall average value of 4.7. And the NumPy integration can go and make that array for you. So it's making not just one number, but a whole array of numbers with a particular shape and with particular properties. And that's everything. So questions or, oh, well, I timed that well. Do you have any suggestions for, I guess strategies for coming up with the properties to test for non-trivial functions? So you might have some sort of mathematical function and basically the only way to tell if you've got the answer writers to do what you just did in the function sort of thing. Yeah, so I'm not suggesting that hypothesis is the appropriate tool for all testing. And there are certainly going to be test cases where you, for example, you've had a user bug, you've had to work out exactly what set of test cases, exactly what set of inputs triggers the problem. In that case, you're going to want an explicit test case with these values so that you're always testing that particular user bug. There's a whole range of problems where hypothesis is not suited at, but I think there's a whole range of problems where it is suited at. And you've come to the crux of it, is that coming up with the properties is the hard part, and you really have to engage the brain on that. We'll just leave questions here for now. We're out of time. So, Clinton. Thank you very much.