 Good afternoon. My name is Heidi Thorpe and I've worked for a government agency in Australia as a data scientist and today I'm going to be talking to you about Fuzzing and how you can use this in your testing How you can use fuzzing to harden your systems mainly against cyber attacks. So First of all, I'm going to introduce the subject What are we going to do? What is fuzzing? It is the automated generation of data for testing for fuzzing to work best It has to be more than just randomly generated invalid data So for instance, you have an image classifier That you're going to train to look for horses If you were going to use random test data Then you might have pictures of an orange or a chair Random pictures random data that you might use You test the classifier with the orange and you say no no that orange is not a horse You test the classifier with the chair and you say well, it's a bit closer. It's brown has four legs You can sit on it, but it definitely isn't a horse so Then you say okay with fuss testing instead of using images of an orange and a chair You would instead possibly use images of a donkey and a unicorn. A donkey is a donkey a horse well not really but Closer than an orange is a unicorn a horse well Still not quite, but it's still closer It's invades closer to what we're expecting, but it's still Unexpected data and could cause unexpected results So what we're going to be doing is creating XML or JSON that looks almost right But gives you substantially more edge cases than you would get using just manual testing The objective is to come up with data that superficially looks correct But it's not because it is machine learning you can automate the generation of the data and the execution of the tests The grade of greatest vulnerabilities Crashes null pointer exceptions buffer overflows you can run thousands of tests and in this way look for more exceptions, I can't show you real Client data, so what we'll be doing is we'll be using MS word data because internally it's XML If you take an MX word document, you'll notice that it's a zip file and you can unzip the zip file And this is the set of data Files that you get when you unzip the zip file It's not a practical example, but it's a demonstration of fuzzing and how to generate XML with the added advantage that we can Import the data back and see the results at the end This is an example of one of the XML files from a Word document and since many rest interfaces Portals, etc. Expect XML or JSON. It's still a valid test So what is neural fuzzing neural fuzzing is a technique using neural networks for fuzz testing in this case using neural Networks to generate test data Why is fuzzing a good idea well fuzzing can save money, but it's not just about saving money We'll start at the programming step the programmer makes a mistake the software then moves through the QA and test cycle The bug is not found The testers are not using fuzzing, but maybe they're not using enough testing Maybe they're not using the right data The software is released and at some point in the future an attacker finds Finds the bug probably they're using fuzzing maybe However, they do not However, they do find the vulnerability they find it and attack the system They send data to the software they cause it to crash they can then steal money data Property whatever they like It could quite literally be death and destruction If an attacker crashed the software in your car and this has actually happened not necessarily XML, but it has happened to people in their car At some point the owner of the software realizes there's something wrong They realize that the computer vulnerability has been found Computer security experts are consulted and they trace the attack back to the original bug Now it is listed as a known vulnerability and the vendor can fix the bug a patch can be released and This is the worst-case scenario and it happens all the time A better bug lifecycle would be this instead of an attacker finding the bug it is found by a security expert using fuzzy They can responsibly discreetly Disclose the vulnerability before it is found by an attacker and it can be the vendor can put a player patch and the bug is not exploited The best-case scenario would be this one Somebody writes a bug the QA testing find it the bug is fixed and it's not part of the software release This lowers the cost of the software and increases the satisfaction for the software users functional testing alongside fuzzing Increases the possibility that a bug is caught before the software is released So how does it work? first testing involves Inputting massive amounts of random data in an attempt to make the software crash During fuzzing a program is executed many times and after each execution the result is recorded The input is modified. Sometimes the whole data Input is changed sometimes only some of it the modified data is then sent To the program and the result recorded along with the input data if the software crashes the result can be Analyzed otherwise the process just continues until a failure happens There are two types of software Fuzzing techniques done fuzz testing where the input data is randomly changed and the reaction to the random data is observed This doesn't may not provide useful results As most applications expect the data to be in a particular format the technique Should still be employed it is as it's reasonable to expect that sometimes the software will receive this random input The other type of Software fuzz tech testing technique is smart fuzz testing changing specific values leveraging knowledge of the underlying format Or expected behavior In this case the application expect input that is all numbers For instance a phone number these users all abide by the rules and input numbers Fuzz testing systematically simulate a User who does not abide by the rules So here we have an analogy For instance a security guard at a party you send a message with your name He looks on a list and sends a reply. Yes or no. Will you get into the party? To fuzz test this guard you need to do something unexpected you say look over there Unexpected behavior look over there. Is it valid input? It's not expected, but it's not random either The unexpected behavior may trigger a failure Fuzz testing systematically simulate a user who does not follow the rules and could cause a failure a Good fuzzer decides what the malformed inputs should look like and generates test cases It then automatically Executes the tests and records any failure it needs to keep any failure It needs to keep records of any failure so that it can be a repeatable process The whole point of fuzzing is to find vulnerabilities in the software before anyone else does This is important from both security and a robustness point of view Sheer power of having fuzzing generate test cases is that you're going to have useful test cases showing up Now we're going to implement a long short-term memory network a neural network and This is how we're going to generate the invalid data Specifically, I'm using recurrent neural networks and Python libraries with TensorFlow to generate new inputs Traditionally on neural networks start from scratch and cannot use prior knowledge People don't start from scratch and every time every time they read or hear something they use knowledge that they have from prior experience Recurrent neural networks are networks with loops Allowing information to flow between the layers and to persist the most important factor with an RNN is The recurrence the loop to the internal state also known as the hidden layer. You just keep looking for as long as there are inputs Consider what happens when the loop is unrolled there is a box for each time step or input in the sequence LSTMs are a special type of RNN and this is what we're using in this instance The chain reveals that recurring neural networks are related to see lists and sequences Consider when trying to predict the last word in some text. I grew up in France, but I do not speak Something recent information implies the next word would be that of a language But which language we need the context of France from further back As the gap grows between the relevant data RNNs are unable to learn the connected information In theory RNNs are capable of connecting these long-term dependencies, but in practice, this is not the case This is an example of an LSTM network X of t is given where t is time Y of t is what we're trying to find and h of t is the hidden layer Long short-term memory networks Usually just LSTMs are a special form of RNN They're designed to avoid the long-term dependency problem that RNNs have Remembering long-term information is their default behavior. The repeating module of an RNN contains a single layer This is the four interacting layers Of an LSTM the key to the LSTM is the cell state the horizontal line running through the unit The cell state runs straight down the entire chain with only some minor linear Interactions, it's easy for information to just flow along it unchanged The LSTM has the ability to forget or remember information This is carefully regulated by the structures called gates Gates are a way of optionally remembering or forgetting information They're composed of a sigmoid neural net layer and a point-wise multiplication operation The first step in that LSTM is to decide what information we're going to throw away from the cell state This decision is made by the sigmoid layer the forget gate layer It looks at t minus one one step in the past and Inputs a number between zero and one to decide what it's how much it's going to forget and how much it's going to remember a One represents completely keep and a zero means completely forget The cell state may for instance remember the name of the present Subject and when it finds a new subject it may throw away the old subject and just remember the new one The next step is to decide which information we're going to store in the cell state This has two parts first the sigmoid layer called the input gate Decides what value we're going to update next the tan layer Creates a vector of the new candidate values. This could be added to the state In the next step, we're going to combine these two to create an update to the state In the example of our language model We'd want to add the name of the new subject to the cell state to replace the old one that we're forgetting is Now time to update the old cell state C minus C of t minus one Into the new cell state C of t The previous steps already decided what we're going to do. We just need to actually do it We multiplied the old cell state By f of t forgetting the things that we decided to forget earlier and then we add C of T this is the new candidate values Scaled by how much we decided to update them in the case of the language model This is where we actually decide what information we're forgetting or remembering and this is the actual implementation of the previous step Where we actually implement what we're forgetting and what we're remembering Now we're going to look at some code and Because most people are not really very good at looking at XML and deciding whether it's correct We're going to look at some Shakespeare sonnets and we're going to generate some Shakespeare sonnets using LSTMs so So what we've got here is We've got the Keras model the sequential model We're loading some okay We're loading some sonnet text and what we've got to do first is turn the sonnets into Sequences and in this case we're turning them into sequences of Ten characters and this is what the neural network is going to use to learn we build the model We're using a hyper parameter called temperature which Depending on the value of it is used to predict the control the randomness so what we've got is All of the different characters that we're using So we have that is our character set and Usually if you're using words then you'd use word to vex or something that's already been trained But in this case we're just using characters and we're going to generate the The sonnet one character at a time So we have the neural network and you can see It's pretty gibberish and There's just random letters and then as it's learning It discovers the word the So we've found found a word and We're learning a bit more and A bit more and this is just one character at a time We're learning and then when we get to About here we say something death not to me so furloughness Dibbinent So it's it's finding words and it's almost kind of like a Sonnet and then as we go further Further down here. We say okay the stars of summer state and then Betreat of all two sir one gives and sing and this is one character at a time Generating sonnets as it goes and then as we get towards the end So we say here We've got something that actually looks Kind of like Shakespeare the brave day sunk In hideous night when I behold the violet past prime and sable curls all Silvered or with white when lofty Trays I see Baron and you say well that that's one character at a time and this of course is The code that I used for XML is exactly this code But as you can see with with sonnets you can actually see it doing something But with XML it's just that much harder and this is exactly the same code But instead of having the sonnet it has some XML as input and you can see it has gibberish Gibberish and then as it gets towards the end it's actually looking like real XML and so that is What I've used not exactly this code not exactly this data But something similar and that's how I've been generating data for testing because generating it manually just Is just too hard whereas generating it automatically generate thousands of records And that's all thank you