 want to point out that we have ourselves a first-time speaker here. And I'd like to welcome John Seymour to the stage for his talk, Quantum Classification of Malware. Give it up. Hi, everybody. Thanks for coming out to my DefCon talk. And I hope you all find it interesting. My name's John Seymour. And the best way to reach me is probably through my email. But I'll put my Twitter info on the last slide if you need it. So I guess we'll just go ahead and get started here. I'll do the standard a little bit about myself. I'm a PhD student at the University of Maryland, Baltimore County, trying to find out what it means to be a good malware dataset. I've been actively studying and researching Infosec for about three years now, so I'm still a bit of a noob. But I'm trying to bridge the gap between academia and industry. And I'm currently at Cyber Point International finishing up a few summer projects involving both Infosec and machine learning. I've broken up the talk today into a few major segments. First, I'm going to talk about the current state of the D-Wave and D-Wave controversy and what working with one actually looks like. And then we'll switch gears and move into the machine learning background necessary to understand how a D-Wave classifier works. I'm going to go ahead and warn you right now. For a talk named Quantum Classification of Malware, there's going to be a little bit of technical stuff around there. So I hope that's okay if it's not. What if? Finally, I'm going to segue into our design choices and implementation details of actually getting a malware classifier onto the D-Wave II instance that my university has access to. And we found some interesting things as we played around. So we're going to wrap this up with some interesting observations and where we think further research might actually be useful. So I'm going to start. Who here has heard about quantum computing and how it's going to break all our crypto systems and all the things? Like, go ahead, raise your hands. Let's see. Yeah. Okay. And who here has heard the complete opposite that quantum computing is going to allow us to communicate perfectly securely and help us do all the things? Right, yeah. So this is the point of my talk where I just want to lower everybody's expectations. The D-Wave doesn't do any of that, right? Regardless of the state of standard quantum computing, D-Wave doesn't do that. In fact, there's a lot of lowering of expectations that I need to do here because of some misinformation about the D-Wave when it first came out. And let's go ahead and get that out of the way. You might have heard that the D-Wave solves NP-complete problems in polynomial time. This is definitely false. The D-Wave doesn't solve NP-complete problems. Now, the D-Wave might obtain good solutions to these problems, but it's important to remember that often we can do that classically too. It's also a hard question as to whether the D-Wave is better at solving any real world problems at this moment than classical machines. There's a few papers arguing that the D-Wave outperforms classical machines already, but so far these have been pretty spurious comparisons. More specialized classical software was able to outperform the D-Wave on those types of problems. Now, of course, that's not to say that the D-Wave won't become better than standard classical computing in the future, but it's still got a ways to go. And because of all this misinformation, it's no surprise that there's a lot of polarized debate about what the D-Wave can do. To my knowledge, this is the current state of affairs regarding the D-Wave. First, quantum effects are happening, right? But this might not actually be interesting. Quantum effects happen everywhere, even in like NAND flash, for example. The question that everyone's interested in is whether the D-Wave uses quantum effects for computation and whether their implementation will or might perform better in the future. Regardless, it can't run the standard quantum algorithms that everyone gets excited about. And to be fair to D-Wave, they do try to say this pretty clearly at every presentation I've ever been at. It looks like they've made some design choices in the pursuit of solving NP-complete problems, which mean universal quantum computation can't happen on the D-Wave machines. And of course, also to their credit, they have made several advances on cooling and on power consumption for electronic devices. And I've also heard that some of their techniques might be useful for scaling even standard quantum computation. So let's start with like the non-technical stuff, right? Most of you've probably seen this picture before. This is like the D-Wave case. It's a big black box about the size of a small room. And if you open it up, this is what you see inside. This contraption is mostly for cooling. The box also has a lot of room for a technician to stand inside for repairs and whatnot. But the chip is tiny and at the bottom. You can't even really see it from here. So here's a close-up, right? There's actually a lot of classical circuits on this chip. Only that middle left gray square, I don't know if you guys can see it, is the quantum part. And again, that's really hard to see. So I'm going to show up a close-up on the next slide. So this is actually a different chip. This is the Washington, which is the 1000 qubit chip that just came out this summer. But its structure is pretty similar to what we worked with, just bigger. You might actually be able to see a faint grid on the chip. That's a lattice of what D-Wave thinks are qubits. So that grid was a lattice of neobium loops, which is where the possible quantum behavior comes from. These loops are magnetized and then they entangle where the loops intersect. At least I think we have a consensus that the loops entangle at those intersection points. But as with everything else, it's still hotly debated. The idea is that these loops want to be an agreement, which will happen at the minimal energy state of the system. So think like north repels north and south repels south. We want all of these different magnetizations to be pointing different directions. So the D-Wave is programmed by biasing the neobium loops and the couplers which govern their interactions. This formula here is how you represent that mathematically. Given A's and B's which are all real numbers, the D-Wave attempts to find the assignment of Q's such that this formula is minimized. We normally work with Q being either 0 or 1. And when that's the case, this is known as a quadratic unconstrained binary optimization problem, or cubo for short. It turns out that if we could solve cubos pretty easily, it would actually be really useful. But the D-Wave doesn't always get the absolute minimum solution. So the company now calls their machine a heuristic for solving these sorts of problems. So we at UMBC have access to a D-Wave 2 instance in Burnaby, Canada. D-Wave has built a little website with a GUI and everything for submitting programs and parameters for running them to a machine. They also have API access which is basically like using OAuth for those of you who made Facebook and Google apps before. And when you want to play with the chip through their website, you get a visual representation looking like this called the Chimerograph. This is the system six processor that we had access to when we first performed our experiments. Now the first thing you'll notice is all those spots where nodes are missing. Those are called dead cubits or cubits which are defective. And programmers can't interact with those dead cubits at all. And they're assumed that they don't interfere with any of the computations. They're determined when the machine boots up. So Reboot can fix those dead cubits but it can also kill other ones. Reboots of the D-Wave don't happen often. Probably like every two months or so in my experience. Out of the 512 cubits we have maximum, this system six chip has 496 working cubits. But now compare that to the system 13 chip which is what we have access to now, right? There's a lot more dead cubits on this chip. The take away from here is just because they call it a 512 cubit machine doesn't mean the machine actually has all 512 of those cubits. And then finally here's an example of an optimization problem run on the system six chip. The left is what we input to the machine. Each colored node or edge corresponds to the bias that we gave it. On the right is what was returned to us by the D-Wave. And again, each color corresponds to the cubits final state. I think red means that the cubit measured at the end was positive one and blue means that the cubit at the end of the run was negative one. But we like to work with binary variables so we apply a simple substitution function to change the negative ones to zeros. So far we've been talking about cubos and working directly onto the D-Wave chip. Now there are certain problems like three set which actually simply transform into those. And as a side note this is why D-Wave the company is so interested in three set. However D-Wave also developed some closed source software to embed arbitrary minimization problems into cubos and they call the software black box. Generally the problem of embedding an arbitrary minimization problem onto the chimera graph is NP complete. It's very similar to the sub graph isomorphism problem. There might still be solutions for particular graphs but rather than actually solving the problem for a chip with given dead cubits D-Wave instead uses a heuristic called taboo search for embedding problems onto the D-Wave chip. Black box involves a dialogue between this classical taboo algorithm and the D-Wave chip. And what I mean by this is the taboo algorithm on a classical machine finds what it thinks is a good embedding for some chunk of the problem. And then this chunk is sent over to a D-Wave for solve. The solution is passed back from the D-Wave to this taboo search algorithm which uses that as input for the next iteration of solving the problem. And this continues until the machine can't find a better solution or until a specified timeout is reached. So a lot of our time in black box programs is actually wasted just due to network latency as part of this dialogue. But actually coding this all up is straightforward, right? Here's an example of some Python code which connects the system 6 processor and minimizes a given function. You basically just put in the solver you want to use, some parameters, and a function which just returns a value for how good a given bit string is. And then black box will look for the best bit string to minimize that function. So this is cool and all, but the question is what can this machine do? Now D-Wave claims a lot of applications like classification, protein folding problems, solving, getting close to optimal solutions to NP complete problems like traveling salesmen. They do have toy tutorials for most of these on their website, but it's not quite clear to me how those toys size tutorials scale to larger problems. So now we're going to switch up gears a little bit and talk about the machine learning background necessary for the D-Wave stuff. I can't obviously get through everything to do with machine learning and the time we have available to us today, but I'm going to try to go through what's relevant to this project. I'm going to assume you guys know about like supervised and unsupervised classification and that sort of stuff. If you don't, definitely check out Alex Pintos or Rob Bird's talks. They're amazing. But we're using a supervised technique here, which means that the instances we feed into the algorithm we create are labeled before we train our classifier. The D-Wave classifier we look at is a boosting algorithm. And to explain this concept, I'm borrowing from a good tutorial I read recently. Definitely check it out if you're interested in machine learning later. But it's very similar to error correcting codes if any of you work with signal processing. Let's say we have three programs which classify malware. And further, let's suppose any single one of these programs has a 70% probability of being correct for any given instance. You could simply choose any single classifier and be happy with getting 30% of your instances wrong. Or you could be a bit smarter and combine them. A simple way to combine them is by running each of your three classifiers on an instance and use whichever classification the majority assigns as your final guess. Doing this, your new classifier can be right, in our example, up to 78% of the time. Because now your new classifier will be correct whenever at least two of the old classifiers guess correctly. And you can actually check this by writing out the probabilities for the four cases when zero, one, two, and three classifiers are actually correct. But many boosting algorithms allow you to give some weak classifiers more weight than others, though what we look on the D-wave doesn't. Interestingly, the D-wave boosting algorithm does pretty well even in spite of being simpler. All right. So first, definitely don't be scared by the equation here. It's really not that important. I'm going to try to talk you guys through it. But let's talk about this whole process as a minimization function because the D-wave likes to minimize stuff. Central to the idea of machine learning algorithms is the idea of a loss function or a quantification of how poorly a classifier performs. The idea is that you want to minimize this loss. Now generally, loss has two parts. We want to minimize both the number of misclassifications that our model creates and we want our model to be as simple as possible. In our case, we have a set of classifiers and we're trying to find which subset of classifiers can be boosted using majority vote into the best possible classifier. And this scary formula is just a mathification of that. This is an example of a loss function that we actually use for a classifier and use as the function in black box that we're trying to minimize. And I'm mostly including this for people looking at the slides later. I will say, though, that the sign of the W's times the F's are what our boosted classifier guesses the executables are. And then this is compared to whether the executable is actually labeled to be malicious or not. And if they differ, then that's called a misclassification. And after the plus sign is just a term which penalizes using a lot of classifiers. Take a drink. So the final machine learning ingredient to our classifier is the features we use. And now we used ngrams, which is a standard type of feature used in document analysis. We obtained these ngrams from the hex dumps of our malware and benign software. You can think of ngrams as a sliding window over text. So if you'll consider the hex string dead beef, I'll go ahead and give an example of two gram bytes or bigrams. And remember that one byte is just two hex digits. So we take our first two bytes and that's one bigram. And then we take two bytes with an offset of one and that's another bigram. And we keep going until we reach the end of the hex string. So here's the final bigram. And so there are three two gram bytes and dead beef, D, E, A, D, A, D, B, E, and B, E, E, F. Now we actually use trigrams instead of bigrams. So a three byte sliding window instead of a two byte one as the basis for our classifier. But that doesn't really change much. There's a few reasons why we chose ngrams over other features. They've been used before in malware with decent results first off. But we mostly use them because we had no idea how many features the D wave could handle. It's easy to generate a large number of features with ngrams and then to pre-process them down to any given number. And it's also trivial to turn these ngrams into weak classifiers. You can have just simply whether or not the ngrams present the executable as being a weak classifier. Obviously since we're only using three grams, what we build won't be as good as state of the art malware classifiers. We don't need it to be the best malware classifier in existence here though. We're just using it to compare classifiers between the D wave machine and standard classical machine learning techniques. Now hopefully all of that wasn't too painful and we can get into all the fun stuff. At first glance, the D wave looks like it's going to be awesome for classifying malicious executables. There's an algorithm already developed called Cuboost for using the D wave to classify things. Cuboost models in the paper had higher accuracy than at least one standard classical boosting algorithm called Adaboost. But what's really interesting here is that depending on the loss function you use, the classifier can be robust to label noise. Generally, if you're wrong about a lot of the samples in your training set, then many algorithms that you apply to it will learn incorrectly. If an algorithm's robust to label noise however, short of catastrophic failure of labeling, they'll still generally learn even if a significant number of instances are mislabeled. And as you train a malware classifier you have to tell it whether each instance is benign or malicious. Obtaining that ground truth is hard in this domain even as we've found in our own lab. But finally, I found during a talk with the creator of Cuboost that Cuboost doesn't really scale to what's known as Google size problems. However, Blackbox handles chunking of problems and so it supposedly can scale to larger problem sizes. There also was a tutorial for implementing Cuboost while using that Blackbox software and it looked pretty easy to do. So that's what we did here. Our goal at the time of this research was to classify executables as either being malicious or benign. Of course, there's loads of malicious datasets to choose from. We used VxHeaven which is a pretty standard dataset for training malware classifiers. Although now it's starting to show its age, it's like 10 years old by now. However, there's no standard benign software datasets and this is pretty problematic. For benign executables, we used a combination of executables found in clean Windows XP and 7 installs. And the executables resulting from installation of SIGWIN and certain SourceForge executables based on some previous work we did. Now, first off, don't do this. We don't claim that this is an acceptable dataset for future malware classification. It's not very diverse or representative of benign executables in general. We're actually trying to solve that problem now. But as a final note on datasets, we do know about the SourceForge AdWare and we would like to make the disclaimer that no AdWare was used in the making of this classifier. So there's some classical pre-processing that we did before we threw this thing onto BlackBox. First, you'll notice that we have tons more malware than we do benign examples. If we created a program to classify executables that always returned that the executable was malicious, that program would do extremely well on our dataset even though it's not actually learning what malware actually is, right? Just like a random number generator that always returns for isn't really random. To get around this issue, we sampled with replacement and there are some upsides and downsides to that. First, like classifiers are faster to train but we are throwing a lot of information away by doing so. Sampling with replacement also has some good statistical properties for the underlying distribution if you care about that sort of thing. Now again, we use three grams as the basis for our classifier. Knowing that what we built here won't be as good as state of the art systems. Though we are not turning any heads with the accuracy of the models we built here, the models will be complex enough to compare accuracy and timing information. And after we've done all this, a simple Python program which uses BlackBox along with the system 6D wave 2 instance that we have access to, to minimize that scary loss function from earlier, that's actually what will become our malware classifier. It will determine which Ngrams using majority vote, best classifier or malware. So when we first trained up our classifier, it wasn't doing any better than random chance. We did some digging and found that BlackBox was using up all the D wave time that we actually allotted. To solve this problem, we needed to increase the amount of time that we allowed BlackBox to search for a solution. But the question is how much time do we actually need to give it? To get reasonable accuracy on a problem with a given number of variables. So previous work using BlackBox mostly deals with NP complete problems. So they all use a rather large and arbitrary 30 minute timeout. And many classical models on the scale that we built in the past, especially after resampling to smaller numbers of executables, trained in a few seconds to minutes. So this is an extremely large time. We originally thought that slimming down the problem in this way would give us a reasonable decrease in the time required to solve the problem. But we quickly found out that this wasn't the case. Even for minimization problems with small numbers of binary variables, it took over 10 minutes to get decent solutions. But we still pressed on just in case maybe some accuracy increase might justify a 10 minute model creation time, even for very, very simple models. So as a result of our pilot study, we decided to restrict our classifier to 32 features to balance the complexity of the classifier with the time it took to train one. Now 32 features is a very, very small number of features for machine learning problems. But we found it took almost an hour to train a single model and we had a limited allotment of time on the D-Wave machine. We kind of naively split those 32 trigrams from earlier into 16 each of benign and malware features. Then we used that same Python code from earlier to train some classifiers and we noted the time taken to train, the accuracy and which features were present in the final boosted classifier. We did that on both the D-Wave chip and the D-Wave simulator, which is classical in nature. Using the same features, we compared the D-Wave classifier to several models we built using WECA, which is a bulky machine learning library written in Java. We compared the D-Wave classifier to AdaBoost, J48 decision trees and random forests. It should be pretty obvious why we compared to those, like AdaBoost and Cuboost have been compared to before and use our similar techniques and J48 and random forests are easy to use techniques that work right out of the box and also have been shown, I think, to be pretty good with malware as well. But I'll just take a minute to let you all look at our results. And there are two major things that look super weird about this. Maybe you guys can spot it. So, yeah, our first finding is that for quantum speed up, this thing is extremely slow compared to classical algorithms. In fact, timing for the D-Wave stuff is actually under-reported. We only included the total time that the D-Wave itself was running on our problems so that we don't include any latency caused by the network in this calculation. Remember that Blackbox involves a conversation between classical and quantum hardware. As a side effect, the classical time for the Tapu algorithm from Blackbox isn't actually included in the time taken to build the D-Wave classifier. Now, we found the D-Wave to be middle of the road on accuracy, but it takes 10,000 times as long to train. And remember the other algorithm scale, but we had to heavily restrict our numbers of features in order to train the D-Wave classifier in a reasonable amount of time. Now, we don't know why it takes so long, but we do have a few guesses. It's possible that Blackbox isn't finding very good embeddings, or maybe the D-Wave isn't actually getting good enough solutions on those embeddings. It could also be that those dead qubits from earlier are really screwing with Blackbox. Or it could also be that Blackbox is trying to solve an exponential problem. We don't really know right now. But our second interesting result is that the D-Wave simulator, which is, again, classical in nature, takes less time to train than the actual D-Wave chip. And that's kind of surprising because, like, why buy a D-Wave when you can just use the D-Wave simulator on your own laptop, right? We think this might be an artifact of dead qubits because the simulator assumes a perfect D-Wave, but it's still really, really weird that the simulator for the D-Wave to actually outperform the actual chip. So what does this all mean, right? We found that while it's possible to create a malware classifier using the D-Wave and that it has similar accuracies to standard machine learning techniques, it's not very practical. There's significant overhead and we need to restrict the problem substantially. We don't know exactly where this overhead comes from. It could be from D-Wave software to embed arbitrary minimization problems onto the D-Wave chip. Or it could come from the D-Wave chip itself not finding good enough solutions. However, we're betting the black box is the problem here. Regardless, it seems that D-Wave isn't quite ready for even this sort of toy problem, much less the real world malware problem that we currently deal with. So we probably really should have stuck with Cubus here. Even though it's not ready now, there's still some areas we'll look into before closing this issue. There's a few methods you could try to do to get around this timing problem. We could just wait. The D-Wave chip size is supposed to double every couple of years and defects should decrease over time. Each qubit should even exponentially increase the size of the problem that the D-Wave chip should be able to solve. It's possible that the next generation chip or the chip after that will be fast enough for this method to compare well to standard models. But of course, waiting is no fun, right? Instead of solving an embedding of the problem directly onto a particular chip, rather than using characteristics for embedding, is what I think probably should be the best route. We did also notice that the D-Wave classifier often used less or different features than the classical algorithms we compared to used. So it's possible that Cubus might be useful for some other purpose, like feature selection or feature preprocessing. But that timing issue is still there. Other than D-Wave issues, we noticed that most Infosec data sets are out of date and relatively small. Private researchers regard data sets and features as being part of their secret sauce for classification. These facts combine to make it really, really hard to reproduce results or effectively evaluate our own creations. And that's a challenge we actually really, really need to overcome as a field. So actually, that's the end of my talk. I heard a lot of people were actually flying out later today and I wanted to make sure we had a lot of time for questions. So that's my experience with programming at D-Wave 2 to build a malware classifier and I hope you all enjoyed it. So I can go ahead and field questions if anyone has any or I guess I could step down and meet with people one-on-one. I'm honestly kind of better at that sort of thing anyways. So anyone have any? Right. So I haven't actually seen any studies on scaling of Cubits here and I really, really would like to in the future. It's definitely possible to do so. And so yeah, I think that's a good next step to see how the new D-Wave chips will actually sort of be in the future. Oh, yes, sorry. If anyone has questions, please use the mic. All right, cool. Thank you guys.