 Oes, felly dyn ni'n gweithio'r credu boblion yw'r cyfnodd wedi'i'r ystyniad, ymddorol, a fyddwn ni'n gweithio'r lluniau yn y blynyddol yn byw. I'm here to challenge how you test your code. Let's see where the audience is as a starting point. Well, hands up if you've written the test. Okay, hands up if you've only used how coded values as example data to pass into your code for enough, hands up if you've used some sort of random input, good, hands up if you've used property based testing and hands up if you've done any fuzzing, good. So yeah, some good things to learn about today then. So I'm here to ask you, are you testing your code hard enough, are you stretching it, are you asking it questions it wasn't expecting, are you an aggressive interviewer, are you a softball interviewer who asks the easy questions, she wasn't expecting that question and maybe you should ask your code some questions it's not expecting. So what are we trying to achieve with our test input data? Possible goals, well to start off with you want to cover the happy path, that's the one that's going to earn you some money and get the job done but that's the absolute minimum. You probably want to cover all of the code base, you want to cover exception handling and the validation when the user passes you some data that you understand not to be valid but what about unhandled exceptions? By definition you're not expecting them so how are you going to think up examples for that data? If you knew what they were you probably would have caught them already and then not just covering each line of code but you really want to cover the independent paths through your code base it's going to start sounding like a lot of work isn't it? So I'd like to just ask you to take a moment to think about the data that you pass into the test that you've written, this is to help you contemplate. So which type of points do you pick when you're writing your example data and where would an adversarial approach take you? I'm not an artist but Google image search is quite handy. So this is the artist impression of the central parts of the input space here, maybe the obvious examples, Jane at example.com but maybe you're under testing some more difficult examples. I've certainly seen unicode errors that would have been caught if example data had included a unicode snowman or something like that. Also just passing in empty lists, empty strings, it's good as a base standard for edge case testing. So how do we create test data? We can write hardcoded values like most of us have done. We can create purely random data just with no feedback or so model mommy does this just gives you something conveniently random that will get the job done or by firing a failure seeking missile. Boom. Hardcore, let's take a closer look. So quick check is a Haskell library. Don't do it just yet. So it's been around for a while. Hands up if you've heard of it. Cool. So this is property based testing. You specify property of your code that must hold and quick check will quickly check if it can prove you wrong. It does its best to find a counterexample. So this is a little bit of Haskell. So the basic thing to take away here is that two lists of integers go into a property and a boolean comes out whether this property holds or not. So it's about reversing lists of integers. So basically imagine four integers, a list of four integers and a fist of four interfingers here and it's saying you're going to join them and reverse them as opposed to reversing them and joining them. Okay. So it's a slightly dubious proposition, but we're going to see if it can prove it wrong. And at the bottom there, you can see zero and one as the input after some shrinking and it has proved it wrong. But this isn't Euro Haskell monad con. So let's not accidentally learn too much Haskell instead. Let's find out what functional language developers think of our world. This is a direct quote. In an imperative language, you have no guarantee that a simple function that should just crunch some numbers won't burn your house down, kidnap your dog and scratch your car with a potato while crunching those numbers. Fair enough, but we like Python anyway. So hypothesis is the Python version of QuickCheck. It's more of an update because it adds some new features. Let's delve into the kidnap dog world of Python. So just as a reminder, this was the Haskell version. In Python, there isn't a function that reverses lists I've made on here. And in a similar way, you can see the act given decorator here specifies two lists of integers which map to the two inputs to the test function. And then at the bottom, you can see the property is defined just with a standard assert. We're running it with a pytest runner and there's a little tiny hypothesis pytest plug-in that just helps you see the output a bit clearer. So it proves it wrong and actually comes up with the same counterexample that QuickCheck did. And we didn't have to think up any example data ourselves. So that was an improvement. So what's going on here? So maybe it's doing a formal proof in the background. Maybe it's doing some sort of static analysis. Maybe it just passes a symbol into the top of the program, looks at all the manipulations, ends up with a formula and then solves that formula. Now, even for mathematicians, they haven't quite got there yet. I tweeted a really interesting article called Will Computers Redefine the Roots of Math, a wired article. But no, they're not here yet. Mathematicians still have a job and same in computer science. They haven't really managed to, especially not in Python. So that leaves us trying a crad time of examples, also known as fuzzing. That's what's really happening. Let's have a look at the dirtiness under the covers. OK, so this is the first list of integers that hypothesis is sending in. If I go over here. They're pretty nasty, and it turns out the proposition is false. Fine. So basically it's proved it wrong in the first hit, but it doesn't want to show you that, because that's kind of ugly, and I don't think that would pass code review if you tried to put that as a hard-coded example. So it has a go at making it simpler. As we scroll down here, you can see the first list is getting shorter. So you've got three items now. It's worked out that big numbers are more annoying than little numbers. So those numbers at the top there are getting shorter, just two numbers, one number, keep on going. And the second list will get shorter as we get down, and it's getting it. It actually overshoots. It gets something so simple that when you reverse it in each way, it's actually true. So that's bad, doesn't want to show you that. Simple, simple, tricementi lists, and this is the simplest one it could come up with, and then that's the one it ultimately shows you. So that's the one you copy-paste into your deterministic code boat into your test set. As an aside, if you run that again, it won't go through the whole business of those massive list of integers because it's got a local database of successful examples. So that's kind of mainly for a speed enhancement, but also if it's searched really hard, and maybe there's a bit of luck that it's found a counter example, it wants to keep on to that because it might not find it next time theoretically. So what's going on here? It's generating random-esque input data. So it's not purely random. It's random but with a view to breaking your code. It runs the test repeatedly, so it's really worth bearing in mind this is not like a standard unit test that just gets run once. That at given decorator means that test actually gets run by default 200 times, or at least until a falsifying example has been found. If it finds a counter example, it will then try and shrink that to the best of its abilities just to give you the cleanest, simplest counter example to prove your property wrong. So let's have a look at that random-esque data. Where did the integers come from? So this was the decorator. So this is strategies.list and strategies.integers. And the integer strategy is made up of two parts. The random geometric int strategy is basically saying, give me smallish numbers like maybe zero, maybe minus one, maybe 17 will break your code. And the other one, the wide range, says, well I'm going to give you basically any random number from like anything that your Python interpreter can handle. So massive integers and just maybe that will upset it a bit. These are strategies to relentlessly, sorry, they are relentlessly devious plans to break your program. And the list strategy is to say, you pass it the elements you want in your list and it averages at 25 elements, but you can set it to maximum size, minimum size. It's got sensible defaults, but you can override them if you need to. So Raymond Heddinger tweeted this, calling it the, so you might not know what the percent does, it's the remainder upon division. And he has suggested two properties that should hold. The result should have the same sign as y and the absolute result should be less than the absolute on y. Okay, well let's check. This is how we would write that. So there's no list of integers, there's just two integers here. And they relate to the x, y inputs to the test function. So we've got a new function here, assume, this is a way of giving feedback back to hypothesis and it says, A, if this assumption, it proves false, in other words, if you give me a y that's zero, stop the test, it's not appropriate, and don't give me any more like that. Okay, and so in that way it's guiding itself to be more helpful, give you inputs that are more likely to be relevant. So we calculate the result, I mean, I had to create a same sign function, but apart from that, it's pretty much reads as English or copy and paste from the tweet. Let's see what the answer was. It passed. Okay, I should know better than doubting Raymond Hettinger. But I can and will property test his tweet. How does it do it? So the data strategies are probability distributions to pick elements from within that distribution. There's guided feedback by assume. Shrinking of counter examples to be clearer to read and easy to understand why they're breaking your property. And a database of failing examples for speed, especially when you're doing TDD, if it finds something wrong with your code, you've broken a property, you can have a go at fixing it or try it straight away again until you make it pass. The internals of hypothesis are really interesting. I won't explain them. They use a parameter system. It's worth having a read up. He's got a good page on the documentation about that. Let's look at one more strategy. So we've seen integers. The floats are a bit more complicated. If you claim that your function accepts floating point numbers, it's going to do all these mathy-sounding things, Gaussian, bounded, exponential, maybe just some integers. You weren't expecting that, were you? And then some nasty floats are taught as well, say zero, minus, whatever, minus infinity, positive infinity, nan. You can assume these away if you don't want them because if you're doing maths, they will probably break your code. There's some great advanced features of hypothesis. It makes it very easy to take the built-in strategies and make your own strategies. Say you've got a function that accepts a comma-separated list of integers. You could map a list of integers, have them joined by comma, and then pass that into your code because you want your test data to be relevant to your test. It can't just all fail at the first hurdle because it's too random. So you might want to build your own strategy like that. There's plug-ins for Django, a bit like ModelMummy or Factory Boy, and NumPy as well, that's Prototype. There's also a bit experimental but a stateful testing where you give hypothesis the controls to your program and it tries to find a sequence of actions which cause a test failure. I don't know, this sounds very interesting to me. Then moving on. Let's look at another failure-seeking missile that's getting a lot of attention recently. American Fuzzy Lop. This is a fuzzer, a second version. The first one was called Bunny the Fuzzer. So I think Michael Zalewski likes rabbits and they're certainly fuzzy. It specializes in security and binary formats. Low-level libraries are essential to everything we do, whether it be accessing a database, image processing, encryption. So we'll get on to the Python AFL in a minute, but just for a moment, let's think at the C-level. So just to remind you of fuzzer is something that fires data at your program, attempting to crash it. So we've kind of moved on from property-based testing. This is more about crashing your code than specifying properties. So these are things that you want to leave running for a good while, maybe on multiple cores, and speed is very important because the more ground you cover, the more likely you are to find some interesting inputs. So fuzz testing has been around for a couple of decades or more. There's fuzz backwards that people have been using for a good while, and AFL is kind of a new style with some guiding going on. But traditional fuzzling is not dead. This is very important. AFL might be the new cool thing that came out last year, but Google has been running fuzzling against FFMpeg for a couple of years and found 1,000 bugs. Literally 1,000 commits fixing those bugs, so not to be sniffed at. If you don't know FFMpeg is a video processing library, it's in Chrome, it's probably in your local video player that you have on your Ubuntu desktop. So the strategy was to take small video examples, small video files, mash them together with mutation algorithms, maybe splice them together, mutate some bits or bytes here and there, and then admittedly they had 500 and then eventually 2,000 cores over a period of two years, so maybe not just your laptop, but they found they made great progress. A lot of memory management bugs were found. Ashley, I was speaking to one of the FFMpeg developers last night, and I can confirm it's not just because they've written awful, awful code. It's just because this is quite a hard thing to do. The video specifications can be 600 pages long, and they have to write very, very fast code. That's why they don't write it in Python. They have to look after all their memory management, and it's very easy to not do that perfectly. Now there's a quote on that blog post where they thought 20% to 30% of the problems found could easily be exploitable. So there's 100 to 200-day exploits that they found with that ffuzzing. So something tells me the local security services and or hackers, this would be a good approach they might be doing. Let's hope the good guys find these bugs first. So AFL's goals, it does need to be very fast because it's got a lot of ground to cover. It needs to be reliable because if it breaks overnight, it's not going to get much done. I think in the past fuzzers took a lot of configuration, but this one tries to be very simple, not require much setup. I'll show you in a minute. And it does the things traditional fuzzers do in terms of taking some sample inputs, mutating them, but it also has a little secret, which is it adds compile time instrumentation. So this means that when you compile your C code, at each branch point it adds a little hook that records the path taken through the code. So this, like we saw earlier about code coverage and taking independent paths through the code, it's able to get feedback on where its test data travels through the code base. And you really just drop in replacement. So you might have been using GCC before, you just replace it with the AFL version. So here's a toy example. So we're literally just reading 100 characters from standard in. And the bug we're simulating is if the input is foo, then we're going to blow up. So it's a toy example. Let's just compile that. So there's no configure here. We're just compiling it. And when I echo into the program there, I've got some print statements. So it said 1, 2, 3, 4, and it did abort. OK, so it works. Let's try fuzzing it. So minus size the input directory. So I've got one sample input, which is literally just one file with one dot in it, just to kind of say here's something to get going on. I'm not going to tell you what the answer is, see if you can get to the answer. There's an output directory where the results will end up. And it's worth saying if you're on a laptop like me, you probably want to use a RAM disk because it's going to do millions of writes and you might have your SSD stop working quicker than you thought. And it's just going to fire this test data into the standard input of our program. So this is the dashboard you get with AFL. So draw your attention. I've got a laser here. So up here we've got total paths and unique crashes. So it's found four paths through the code base, which is those if statements. The strategy yield down here, this gives you a sample of some of the mutation operations it's doing. So bit flips, byte flips, arithmetic. I'm going to try some integers. We haven't given it a dictionary here, but I'll show you one of those in a minute. My notes have disappeared. That's helpful. Let's try that. Okay, recovered, right. And the other thing to show you, it's done almost a million runs in what's that, two and a half minutes. Did all my notes disappear? That's not helpful. Okay, next slide. So within the findings directory, you get a queue of interesting inputs that it has found take a different path through the code base. So it started with that dot I mentioned, which was my sample input. It's found that after manipulating that, it found one that started with an F. So it's clearly trying thousands and thousands of examples, but when it happened to find one that started with an F, it took a new code path. So it recorded that for reuse. Similarly, F-O, F-O-O. So it's kind of using, it's kind of stepping up, making it further through the code base each time. And in the crash directory, it's found an example input that crashes the code. So this is exactly what we'd be looking for. So let's just have a look at that crash file. So it has a kind of long file name, but it records, this is where it's recorded what's happened. So it's told you signal six is a sigabort. We did the abort, so that's expected. It tells you that it's based on the third item in the queue. It's done some 8-bit arithmetic. In other words, it's replaced that Y with the umlau with an exclamation mark. So you can kind of see how it's working. You can see how it's manipulating previous inputs. So it's able to stand on the shoulders of what it's achieved so far and get one step further. So in the last year, it had a very, very impressive trophy case here. This is about a third of the list, by the way. So there's security libraries, image libraries, SQL libraries, you name it, almost. But generally focusing on libraries that have, can be random input, but also you've got bash there. You have to give it some more help when you're doing a kind of non-binary input because if it just fires random characters at SQL light, it's not going to get very far. So let's have a look at a specific example, SQL light. It's worth saying SQL light is a very, very well-respected library in terms of testing. It has already been ffarsed by a traditional ffuzzer. So you might think there's not much, you know, low-hanging fruit there. The approach taken was to start with a dictionary of SQL keywords. So you literally just put these kind of one per line in a file. They grept out some hand-written test cases from the SQL light test base. And they found 22 crashing test cases. This is one of the simpler ones. So this ends up arriving in a function with an argument not initialized or something like that, or a zero-length argument where it was expecting a list of one or more things. So these were able to be fixed. So how does it do this? Let's just see an overview. It is a great traditional ffuzzer and you can use it without the instrumentation. It will search for inputs that span different code paths and it uses genetic algorithms to mash together the examples it's seen so far, as well as just mutating those examples one at a time. But you can imagine it's searching the input space and it's got some help. It's got some guiding by the instrumentation, but it's always going to be a slow process because the input space can be massive. It certainly can't just go A, B, C, D and do an exhaustive list of all inputs. That'll take forever. So let's have a look at ffuzzing C Python. It's worth saying that, obviously, the innards of Python are written in C. So this is a different proposition to fuzzing Python code, which we'll see in a minute. So you can download Python source, Mercurial. You can compile it, very similarly to, as the Python docs tell you, but using the AFL, Clang Compiler. And you can start fuzzing it. So I've got a sample input, a sample target program here. So I'm not a C-types expert. So I can't explain that line, the magic line that you need there, but that connects things up. And you're literally just passing standard in to jason.load. It's treating it as a file. And we're not catching Python exceptions here, but we're looking for exceptions in the C code. So I ran this overnight, the other day, for eight hours. It didn't find any bugs yet. Maybe I shouldn't be so surprised, it would be a bit easy if it was that easy. But it did run 121,000 times. It is a lot slower than just running the toy example earlier because it's loading the whole Python interpreter, et cetera. But there's tools within AFL to make this faster and easier. So they have a fork server, and they have various techniques you can use to make things faster. It also gives you some hints before you start about putting your operating system into performance mode instead of power-saving mode. So you could say this is more ethical than causing global warming by mining Bitcoin, but certainly it can cause laptops to get a bit hot and a bit CPU intensive. OK, let's move on to Python AFL, because this is not Euro-C 0.reception conference. So this uses Siphon to connect the C layer and the Python layer. It connects the instrumentation that we mentioned to the Python interpreter via sys.settrace. So every scope that's entered will log a little waypoint travelling through the code base. And your unhandled Python exceptions get converted to SIGUser1, which AFL will recognise as a signal. And you can see here Pi AFL Fuzz is just basically type Pi before AFL Fuzz, so it's literally just as easy to use. So here's an example Alex Gainer did of using this to fuzz the cryptography library. So it's pretty simple. You have a little afl.start hook there that connects things up. And he's literally just passing standard input to decode a signature. He said it was fruitful, but I don't know if he listed any particular bugs that he found. But this is the general approach. So what are some interesting questions raised by these two libraries? So in default mode, hypothesis and afl fuzz will give you new input data every time you run them. So this could be considered annoying to some people. They want to know if a new commit fails their tests, and if you've got different test data every time, well maybe it failed because it found a different test input rather than your commit causing it to fail. So consistent pass or fail some people insist on. On the other hand, you might find more bugs. So that would be handy as well. So I think the resolution between the two is to do the non-deterministic testing. Maybe not in your per-commit testing, but to look for the country examples it pulls out and copy those into your deterministic test pack. Or just live with the non-determinism and find more bugs. That's what the author of Hypothesis recommends. You can put it in deterministic mode if you insist. So we've been thinking about random input. One way to think about this is if things are too random they won't even get through the starting gate. They have to be relevant to your code. On the other hand, you can't enumerate all the examples because there's too many. Your space is often just too massive. And if you give a happy path a sample input and you don't mutate it enough you're just going to go straight through the maze and come out the other end and you're going to think everything's fine. So you want to have that sweet spot where you're reaching your dead ends of your code base, dead ends of the maze which represent all the paths through your code base, but not having them all fail at the first hurdle because they're too random. So which library should you use? If you insist on just using standard unit test you can't expect the unexpected. So you should probably use hypothesis if your input data are Python structures. And even if they're not built in data structures you can build your own strategies. It's quite fun as well. And also it means you don't have to think up the hard coded examples. So people say that the test code base becomes easier to read because they're not distracted by the specificity of bobbatexample.com. It just says this function takes all strings or all lists of integers. And something like AFL or Python AFL if your inputs are binary or as we saw maybe you're parsing text input like an SQL library. So in conclusion we've seen two styles of test data generation. Humans are generally bad at picking random examples. Developers are bad at being adversarial to their own code bases which they understandably love. Computers are fast, let them play with your code and find more bugs before your customer or the secret service does. Let me end by saying don't interrogate your code base like it's a fluffy bunny stuck up a tree. Fire a guided missile blow the branches off the tree and clear up the mess. It's not just me saying it. It's a pretty endorsement. Just a little reminder there. And also of interest you don't even have to get up. The talk after this one which will be more informative and better presented by more at Scromback who I haven't talked to yet. I hope I didn't cover all your points. It's in this room directly afterwards. And there we go. I've been Tom Byner. Any questions? Thank you Tom. Is Hypothesis Python 3 ready and could it be made to use the type annotations if they were available in the code? I think it is Python 3 compatible. But I don't know about the type hints. You'd have to ask the author of that library. Sounds interesting though. Thanks for the interesting talk. How does Hypothesis handle if the code under test exhibits some randomness itself easier voluntarily maybe a Monte Carlo algorithm or involuntarily by a mistake? Very good question. It will raise a flaky code warning. So it will tell you that it's workflow is you can suppress the warning I think but it basically tells you that if the code is non-deterministic then you're less likely to find helpful results. You may want to put your code into deterministic mode itself or take another approach. I just wanted to point out sorry if you want to help making libraries, especially sea libraries, more secure there is a project called Fossing project by Honno Book which helps you with some documentation to get started with AFL and fossing your favourite library so that's a very good point to get started if you want to make things more secure. Can you just repeat the name of that? The Fossing project, I believe. Thanks for the talk. You mentioned that Hypothesis runs its iterates over its tests many times to produce results. Do you run it as part of your standard test workflow? Do you run it somewhere else in your testing workflow? I personally would bite the bullet and use it in a TDD workflow. There was a talk the other day about Testmon which uses coverage.py to only run the tests that are related to the code you're changing. You could get a speed improvement from Testmon and then balance that with a slowdown from Hypothesis and just maybe end up where you were before, but with finding more bugs. I have a quick question. Because Hypothesis is generating random input, is it a bit strange to use on a project where you have multiple developers because then doesn't each developer have different input into the test? The inputs are non-deterministic to start with. Even before you had a local database of examples everyone's getting different test inputs. But this idea of sharing the database of found examples I think is still a work in progress. I think the developer of Hypothesis is still trying to think through whether it makes sense to, for example, have that on your CI server or whether that's a bit of a non-starter. You can add another decorator to give them specific examples to force it to give a specific example. So if some developer found a certain input that was helpful they could do a commit that hard-coded that as always present. Please join me in thinking Tom once again.