 next talk is about how risky is software you use. So you may be heard about Trump versus a Russian security company. We won't judge this. We won't comment this. But we dislike the prejudgements of this case. Tim Carstens and Parker Thompson will tell you a little bit more about how risky the software is you use. Tim Carstens is CITL's acting director. And Parker Thompson is CITL's lead engineer. Please welcome, with a very, very warm applause, Tim and Parker. Thanks. Thank you. Howdy. Howdy. So my name is Tim Carstens. I'm the acting director of the Cyber Independent Testing Lab. It's four words there. We'll talk about all four today, especially Cyber. With me today is our lead engineer, Parker Thompson. Not on stage are our other collaborators, Patrick Stack, Sarah Zatko. And present in the room, but not on stage, Mudge. So today, we're going to be talking about our work. The lead-in, the introduction that was given, is phrased in terms of Kaspersky and all that. I'm not going to be speaking about Kaspersky. And I guarantee you, I'm not going to be speaking about my president. Right? Yeah? OK. Thank you. All right. So why don't we go ahead and kick off? I'll mention now, parts of this presentation are going to be quite technical. Not most of it. And I will always include analogies and all of these other things if you are here in security, but not a bit twiddler. But if you do want to be able to review some of the technical material, if I go through it too fast, you would like to read it if you're a mathematician or if you are a computer scientist. Our slides are already available for download at this site here. We thank our partners at Palo D'Or for getting that set up for us. So let's get started on the real material here. All right. So we are CITL, a nonprofit organization based in the United States, founded by our chief scientist, Sarah Zacco, and our board chair, Mudge. And our mission is a public good mission. We are hackers, but our mission here is actually to look out for people who do not know very much about machines or as much as the other hackers do. Specifically, we seek to improve the state of software security by providing the public with accurate reporting on the security of popular software. And so there's a mouthful for you. But no doubt, every single one of you has received questions of the form. What do I run on my phone? What do I do with this? What do I do with that? What do I protect myself? All of these other things. Lots of people in the general public looking for agency in computing. No one's offering it to them. And so we're trying to go ahead and provide a forcing function on the software field in order to, again, be able to enable consumers and users and all these things. Our social good work is funded largely by charitable monies from the Ford Foundation, whom we thank a great deal. But we also have major partnerships with Consumer Reports, which is a major organization in the United States that generally, broadly, looks at consumer goods for safety and performance, but also partners with the digital standard, which probably would be of great interest to many people here at Congress as it is a holistic standard for protecting user rights. We'll talk about some of the work that goes into those things here in a bit. But first, I want to give the big picture of what it is that we're really trying to do in one short little sentence. Something like this, but for software security. What are the important facts? How does it rate? Is it easy to consume? Is it easy to go ahead and look and say, this thing is good? This thing is not good? Something like this, but for software security. Sounds hard, doesn't it? So I want to talk a little bit about what I mean by something like this. There are lots of consumer outlook and watchdog and protection groups, some private, some government, which are looking to do this for various things that are not software security. And you can see some examples here that are big in the United States. I happen to not like these as much as some of the newer consumer labels coming out from the EU. But nonetheless, they are examples of the kinds of things people have done in other fields, fields that are not security to try to achieve that same end. And when these things work well, it is for three reasons. One, it has to contain the relevant information. Two, it has to be based in fact. We're not talking opinions. This is not a book club or something like that. And then three, it has to be actionable. It has to be actionable. You have to be able to know how to make a decision based on it. How do you do that for software security? How do you do that for software security? The rest of the talk is going to go in three parts. First, we're going to give a bit of an overview for more of the consumer-facing side of things for what we do, look at some data that we have reported on early and all these other kinds of good things. We're then going to go ahead and get terrifyingly, terrifyingly technical. And then after that, we'll talk about tools to actually implement all of this stuff. The technical part comes before the tools. So it just tells you how terrifyingly technical we're going to get. It's going to be fun. So how do you do this for software security? A consumer version. So if you set forth to the task of trying to measure software security, many people here probably do work in the security field, perhaps as consultants, knowing reviews, certainly I used to, then probably what you're thinking to yourself right now is that there are lots and lots and lots and lots of things that affect the security of a piece of software. Some of which are, you're only going to see them if you go reversing, and some of which are just kicking around on the ground, waiting for you to notice. So we're going to talk about both of those kinds of things that you might measure. But here, you see these giant charts that basically go through on the left. We have Microsoft Excel on OSX, on the right, Google Chrome for OSX. This is a couple of years old at this point, maybe one and a half years old. But over here, I'm not expecting you to be able to read these. The real point is to say, look at all of the different things you can measure very easily. How do you distill it? How do you boil it down? So this is the opposite of a good consumer safety label. If you've ever done any consulting, this is the kind of report you hand a client to tell them how good their software is. It's the opposite of consumer grade. But the reason I'm showing it here is because I'm going to call out some things, and maybe you can't process all of this because it's too much material. But I'm going to call out some things, and once I call them out, just like NP, you're going to recognize them instantly. So for example, Excel, at the time of this review, look at this column of dots. What's this dots telling you? It's telling you, look at all these libraries. All of them are 32-bit only, not 64-bit. Not 64-bit. Take a look at Chrome. Exact opposite. 64-bit binary. What are some other things? Excel, again, on OSX. Maybe you can see these danger-warning lines that go straight up the whole thing. That's the absence of major heat protection flags in the binary headers. We'll talk about what that means exactly in a bit. But also, if you hop over here, you'll see, yeah, yeah, yeah, Chrome has all of the different heat protections that a binary might enable on OSX that is, but it also has more dots in this column here off to the right. And what are those dots represent? Those dots represent functions, functions that historically have been the source of, if you call these functions, they're very hard to call correctly. If you're a C programmer, the gets function is a good example, but there are lots of them. And you can see here that Chrome doesn't mind, it uses them all a bunch. And Excel, not so much. And if you know the history of Microsoft and the Trusted Computing Initiative and the SDL and all of that, you will know that a very long time ago, Microsoft made the decision and they said, we're gonna start purging some of these risky functions from our code bases because we think it's easier to ban them than teach our devs to use them correctly. And you see that reverberating out in their software. Google, on the other hand, says, yeah, yeah, yeah, those functions can be dangerous to use, but if you know how to use them, they can be very good and so they're permitted. The point all of this is building to is that if you start by just measuring every little thing that like your static analyzers can detect in a piece of software, two things. One, you wind up with way more data than you can show in a slide. And two, the engineering process, the software development life cycle that went into the software will leave behind artifacts that tell you something about the decisions that went into designing that engineering process. And so, Google, for example, quite rigorous as far as hitting GCC dash and then enable all of the compiler protections. Microsoft may be less good at that, but much more rigid in things that were very popular ideas when they introduced trusted computing, right? So the big takeaway from this material is that, again, the software engineering process results in artifacts in the software that people can find, right? Okay, so that's a whole bunch of data. Certainly it's not a consumer-friendly label. So how do you start to get in towards the consumer zone? Well, the main defect of the big reports that we just saw is that it's too much information. It's very dense on data, but it's very hard to distill it to the so what of it, right? And so this here is one of our earlier attempts to go ahead and do that distillation. What are these charts? How did we come up with these? Well, on the previous slide, when we saw all these different factors that you can analyze in software, basically, here's how we arrive at this. For each of those things, pick a weight. Go ahead and compute a score, average against the weight to da. Now you have some number. You can do that for each of the libraries in the piece of software. And if you do that for each of the libraries in the software, you can then go ahead and produce these histograms to show, you know, like this percentage of the DLLs had a score in this range. Boom, there's a bar, right? How do you pick those weights? We'll talk about that in a sec. It's very technical, but the takeaway, though, is that you end up with these charts. Now, I've obscured the labels. I've obscured the labels, and the reason I've done that is because I don't really care that much about the actual counts. I wanna talk about the shapes, the shapes of these charts. It's a qualitative thing. So here, good scores appear on the right, bad scores appear on the left. The history of measuring all the libraries and components, and so a very secure piece of software in this model manifests as a tall bar far to the right. And you can see a clear example that in our custom Gen2 build. Anyone here is a Gen2 fan knows? Hey, I'm gonna install this thing. I think I'm gonna go ahead and turn on every single one of those flags. And lo and behold, if you do that, yeah, you wind up with tall bar far to the right. Here's in Ubuntu 16, I bet it's 16.04, but I don't recall exactly, 16 LTS. Here you see a lot of tall bars to the right, not quite as consolidated as a custom Gen2 build, but that makes sense, doesn't it, right? Because you don't do your whole Ubuntu build. Now I wanna contrast, I wanna contrast. So over here on the right, we see in the same model an analysis of the firmware obtained from two smart televisions. The last year's models from Samsung and LG, and here are the model numbers. We did this work in concert with Consumer Reports. And what do you notice about these histograms, right? Are the bars tall and to the right? No, they look almost normal, not quite, but that doesn't really matter. The main thing that matters is that this is the shape you would expect to get if you were playing a random game, basically, to decide what security features to enable in your software. This is the shape of not having a security program is my bet, that's my bet. And so what do you see? You see heavy concentration here in the middle, right? That seems fair, and like it tails off. On the Samsung, nothing scored all that great, same on the LG. Both of them are running their respective operating systems and they're basically just inheriting whatever security came from, whatever open source thing they formed, right? So this is the kind of message, this right here is the kind of thing that we serve to exist for. This is us producing charts, showing that the current practices in the not so consumer friendly space of running your own Linux distros far exceed the products being delivered, certainly in this case in the smart TV market, but I think you might agree with me. It's much worse than just that, yeah. Let's dig into that a little bit more. I have a different point that I wanna make about that same data set. So this table here, this table is again, looking at the LG, Samsung and Gen2 Linux installations. And on this table, we're just pulling out some of the easy to identify security features you might enable in a binary, right? So percentage of binaries with address-based layout randomization, right? Like let's talk about that. On our Gen2 build, it's over 99%. That also holds for the Amazon Linux AMI, it holds in Ubuntu. ASLR is incredibly common in modern Linux. And despite that, fewer than 70% of the binaries on the LG television had it enabled, fewer than 70%. And the Samsung was doing better than that, I guess, but 80% is pretty disappointing when a default Linux install, mainstream Linux distro is gonna get you 99, right? And it only gets worse, it only gets worse, right? Railroad support, if you don't know what that is, that's okay, but if you do, look at this abysmal coverage, look at this abysmal coverage coming out of these IoT devices, very sad. And you see it over and over and over again. I'm showing this because some people in this room are watching this video, ship software, and I have a message, I have a message to those people who ship software who aren't working on, say, Chrome or any of the other big name, Pone to Own kinds of targets. Look at this, you can be leading the pack by mastering the fundamentals. You can be leading the pack by mastering the fundamentals. This is a point that really, as a security field, we really need to be driving home. You know, one of the things that we're seeing here in our data is that if you're the vendor who is shipping the product everyone has heard of in the security field, and maybe your game is pretty decent, right? If you're shipping, say, Windows or if you're shipping Firefox or whatever. But if you're doing one of these things where people are just kind of beating you up for default passwords, then your problems are way further than just default passwords, right? Like the house is messy, it needs to be cleaned. It needs to be cleaned. So the rest of the talk, like I said, we're gonna be discussing a lot of other things that amount to getting a peek behind the curtain and where some of these things come from and getting very specific about how this business works. But if you're interested in more of the high-level material, especially if you're interested in interesting results and insights, some of which I'm gonna have here later, but I really encourage you, though, to take a look at the talk from this past summer by our Chief Scientist, Sarah Zatko, which is predominantly on the topic of surprising results in the data. Today, though, this being our first time presenting here in Europe, we figured we would take more of an overarching kind of view. What we're doing and why we're excited about it and where it's headed. So we're about to move into a little bit of the underlying theory. You know, why do I think it's reasonable to even try to measure the security of software from a technical perspective? But before we can get into that, I need to talk a little bit about our goals so that the decisions in the theory, the motivation is clear, right? Our goals are really simple. It's a very easy organization to run because of that. Goal number one, remain independent of vendor influence. We are not the first organization to purport to be looking out for the consumer. But unlike many of our predecessors, we are not taking money from the people we review, right? Seems like some basic stuff. Seems like some basic stuff, right? Thank you, okay. Two, automated comparable quantitative analysis. Why automated? Well, we need our test results to be reproducible. And Tim goes in, opens up your software in IDA and finds a bunch of stuff that makes them all stoked. This is not a very repeatable kind of standard for things. And so we're interested in things which are automated. We'll talk about maybe a few hackers in here know how hard that is. We'll talk about that. But then lastly also, we're acting as a watchdog. We're protecting the interests of the user, the consumer, however you would like to look at it. But we also have three non-goals, that are equally important. One, we have a non-goal of finding and disclosing vulnerabilities. I reserve the right to find and disclose vulnerabilities, but that's not my goal. That's not my goal. Another non-goal is to tell software vendors what to do. If a vendor asks me how to remediate their terrible score, I will tell them what we are measuring, but I'm not there to help them remediate it. It's on them to be able to ship a secure product without me holding their hand. We'll see. And then three, a non-goal, perform free security testing for vendors. Our testing happens after you release, because when you release your software, you are telling people it is ready to be used. Is it really though? Is it really though? Yeah, thank you. Yeah, so we are not there to give you a preview of what your score will be. There is no sum of money you can hand me that will get you an early preview of what your score is. You can try me, you can try me. There's a fee for trying me. There's a fee for trying me, but I'm not gonna look at your stuff until I'm ready to drop it, right? Yeah, bit, yeah. All right, so moving into this theory territory, three big questions, three big questions that need to be addressed if you wanna do our work efficiently. What works? What works for improving security? What are the things that need or that you really want to see in software? Two, how do you recognize when it's being done? It's no good if someone hands you a piece of software and says I've done all of the latest things and it's a complete black box. If you can't check the claim, the claim is as good as false in practical terms, period, right? Software has to be reviewable or a priori, I think you're full of it. And then three, who's doing it? Of all the things that work that you can recognize, who's actually doing it? You know, let's go ahead, our field is famous for ruining people's holidays and weekends over Friday bug disclosures, you know, New Year's Eve bug disclosures. I would like us to also be famous for calling out those teams and those software organizations which are being as good as the bad guys are being bad. Yeah, so provide someone an incentive to be maybe happy to see us for a change, right? Okay, so thank you, yeah, all right. So how do we actually pull these things off? The basic idea. So I'm gonna get into some deeper theory. If you're not a theorist, I want you to focus on this slide and I'm gonna bring it back. It's not all theory from here on out after this, but if you're not a theorist, I really want you to focus on this slide. The basic motivation, the basic motivation behind what we're doing, the technical motivation, why we think that it's possible to measure and report on security, it all boils down to this, right? So we start with a thought experiment, get done kind, right? Given a piece of software we can ask, overall, how secure is it? Kind of a vague question, but you can imagine, you know, there's versions of that question. And two, what are its vulnerabilities? Maybe you wanna nitpick with me about what the word vulnerability means, but broadly, you know, this is a much more specific question, right? And here's the enticing thing. The first question appears to ask for less information than the second question. And maybe if we were taking bets, I would put my money on, yes, it actually does ask for less information. What do I mean by that? What do I mean by that? Well, let's say that someone told you all of the vulnerabilities in a system, right? They said, hey, I got them all, right? You're like, all right, that's cool, that's cool. And someone asks you, hey, how secure is this system? You can give them a very precise answer. You can say it has end vulnerabilities and of this kind and like all this stuff, right? So certainly the second question is enough to answer the first. But is the reverse true? Namely, if someone were to tell you, for example, hey, this piece of software has exactly 32 vulnerabilities in it, does that make it easier to find any of them? Right? There's room for it to maybe do that using some algorithms that are not yet in existence. Certainly the computer scientists in here are saying, well, yeah, maybe counting the number of SAT solutions doesn't help you practically find solutions, but it might and we just don't know. Okay, fine, fine, fine. Maybe these things are the same, but the my experience in security and the experience of many others perhaps is that they probably aren't the same question. And this motivates what I'm calling here Zatco's question, which is basically asking for an algorithm that demonstrates that the first question is easier than the second question, right? So Zatco's question, develop a heuristic which can efficiently answer one, but not necessarily two. If you're looking for a metaphor, if you wanna know why I care about this distinction, I want you to think about some certain controversial technologies. Maybe think about, say, nuclear technology, right? An algorithm that answers one, but not two, it's a very safe algorithm to publish. Very safe algorithm to publish, indeed. Okay, Claude Shannon would like more information. Happy to oblige. Let's take a look at this question from a different perspective, maybe a more hands-on perspective, the hacker perspective, right? If you're a hacker and you're watching me up here and I'm waving my hands around and I'm showing you charts, maybe you're thinking to yourself, yeah, boy, what do you got, right? How does this actually go? And maybe what you're thinking to yourself is that, you know, finding good volums, that's an artisan craft, right? You're an IDA, you know, you're reversing OLA, you're doing all these things, you're hidden calm, I don't know, all that stuff. And like, you know, this kind of clever game, cleverness is not like this thing that feels very automatable. But you know, on the other hand, there are a lot of tools that do automate things and so it's not completely not automatable. And if you're into fuzzing, then perhaps you are aware of this very simple observation, which is that if your harness is perfect, if you really know what you're doing, if you have a decent fuzzer, then in principle, fuzzing can find every single problem. You have to be able to look for it, you have to be able to harness for it, but in principle, it will, right? So the hacker perspective on Zacko's question is maybe of two minds. On the one hand, assessing security is a game of cleverness, but on the other hand, we're kind of right now at the cusp of having some game-changing tech really go, maybe you're saying like fuzzing is not at the cusp, I promise it's just at the cusp. We haven't seen all the fuzzing has to offer, right? And so maybe there's room, maybe there's room for some automation to be possible in pursuit of Zacko's question. Of course, there are many challenges still in using existing hacker technology, mostly of the form of various open questions. For example, if you're into fuzzing, you know, hey, identifying unique crashes, there's an open question. We'll talk about some of those, we'll talk about some of those. But I'm gonna offer another perspective here. So maybe you're not in the business of doing software reviews, but you know a little computer science. And maybe that computer science has you wondering, what's this guy talking about, right? I'm here to acknowledge that. So whatever you think the word security means, I've got a list of questions up here. Whatever you think the word security means, probably some of these questions are relevant to your definition, right? Does the software have a hidden backdoor or any kind of hidden functionality? Does it handle crypto material correctly, et cetera, so forth? Anyone in here who knows some computer ability theory knows that every single one of these questions, and many others like them, are undecidable due to reasons essentially no different than the reason the halting problem is undecidable, which is to say due to reasons essentially first identified and studied by Alan Turing a long time before we had micro architectures and all these other things. And so the computability perspective says that, you know, whatever your definition of security is, ultimately you have this recognizability problem, fancy way of saying that algorithms won't be able to recognize secure software because of the undecidability of these issues. The takeaway, the takeaway, is that the computability angle on all of this says, anyone who's in the business that we're in has to use heuristics, you have to, you have to. This guy gets it. All right, so on the tech side, our last technical perspective that we're gonna take now is certainly the most abstract, which is the Bayesian perspective, right? So if you're a frequentist, you need to get with the times. You know, it's everything Bayesian now. So let's talk about this for a bit. Only two slides of math, I promise, only two. So let's say that I have some corpus of software. Perhaps it's the collection of all modern browsers. Perhaps it's the collection of all the packages in the Debian repository. Perhaps it's everything on GitHub that builds on this system. Perhaps it's a hard drive full of wares that some guy mailed you, right? You have some corpus of software and for random software program in that corpus, we can consider this probability, the probability distribution of which software is secure versus which is not. For reasons described on the computability perspective, this number is not a computable number for any reasonable definition of security. So that's neat. And so for practical terms, if you wanna do some probabilistic reasoning, you need some surrogate for that. And so we consider this here. So instead of considering the probability that a piece of software is secure, a non-computable, non-verifiable claim, we take a look here at this indexed collection of probabilities. This is an infinite countable family of probability distributions. Basically, piece of HK is just the probability that for a random piece of software in the corpus, H work units of fuzzing will find no more than K unique crashes, right? And why is this relevant? Well, at the bottom we have this analytic observation which is that in the limit as H goes to infinity, you're basically saying, hey, if I fuzz this thing for infinity times, what's that look like? And essentially here we have analytically that this should converge. That piece of H sub one, H comma one should converge to the probability that a piece of software just simply cannot be made to crash. Not the same thing as being secure, but certainly not a small concern relevant to security. So none of that stuff actually was Bayesian yet. So we need to get there. And so here we go, right? So the previous slide described a probability distribution measured based on fuzzing, but fuzzing is expensive, and it is also not an answer to Zako's question because it finds vulnerabilities. It doesn't measure security in the general sense. And so here's where we make the jump to conditional probabilities. Let M be some observable property of software. As ASLR has railroad calls these functions, doesn't call those functions, take your pick. For NMS and S we now consider these conditional probability distributions. And this is the same kind of probability as we had on the previous slide, but conditioned on this observable being true. And this leads to the refined, the subtle variant of Zako's question, which observable properties of software satisfy that when the software has property M, the probability of fuzzing being hard is very high. That's what this version of this question phrases. And here we say in large, log H over K. In other words, exponentially more fuzzing than you expect to find bugs. So this is the technical version of what we're after. All of this can be explored. You can brute force your way to finding all of this stuff. That's exactly what we're doing. Yeah. So we're looking for all kinds of things. We're looking for all kinds of things that correlate with fuzzing having low yield on a piece of software. And there's a lot of ways in which that can happen. It could be that you are looking at a feature of software that literally prevents crashes. Maybe it's the never crash flag, I don't know, right? But most of the things I've talked about, ASLR, Railroad, et cetera, don't prevent crashes. In fact, ASLR can take non-crashing programs and make them crashing. It's the number one reason vendors don't enable it, right? So why am I talking about ASLR? Why am I talking about Railroad? Why am I talking about all these things that have nothing to do with stopping crashes and I'm claiming I'm measuring crashes? It's because in the Bayesian perspective, correlation is not the same thing as causation, right? Correlation is not the same thing as causation. It could be that M's presence literally prevents crashes, but it could also be that by some underlying coincidence, the things we're looking for are mostly only found in software that's robust against crashing. If you're looking for security, I submit to you that the difference doesn't matter. Okay, end of my math, Dhanka. We'll now go ahead and do this like a really nice analogy of all those things that I just described, right? So we're looking for indicators of a piece of software being secure enough to be good for consumers, right? So here's an analogy. Let's say you're a geologist. You study minerals and all of that and you're looking for diamonds. Who isn't, right? Want those diamonds? And like, how do you find diamonds? Even in places that are rich in diamonds, diamonds are not common. You don't just go walking around in your boots, kicking until your stove tubs on a diamond, right? You don't do that. Instead, you look for other minerals that are mostly only found in your diamonds, but are much more abundant in those locations than the diamonds, right? And so this is mineral science 101, I guess, I don't know. So for example, you wanna go find diamond, put on your boots and go kicking until you find some chromite. Look for some diopside, you know? Look for some garnet. None of these things turn into diamonds. None of these things cause diamonds. But if you're finding good concentrations of these things, then statistically, there's probably diamonds nearby. That's what we're doing. We're not looking for the things that cause good security per se. Rather, we're looking for the indicators that you have put the effort into your software, right? How's that working out for us? How's that working out for us? Well, we're still doing studies. It's early to say exactly, but we do have the following interesting coincidence. And so here presented, I have a collection of prices that somebody gave Mudge for so-called underground exploits. And I can tell you these prices are maybe a little low these days, but if you work in that business, if you go to Sci-Scan, if you do that kind of stuff, maybe you know that this is a ballpark, it's a ballpark, right? And just a coincidence, maybe it means we're on the right track, I don't know, but it's an encouraging sign. When we run these programs through our analysis, our rankings more or less correspond to the actual prices that you'll encounter in the wild for access via these applications. Up above, I have one of our histogram charts. You can see here that Chrome and Edge in this particular model scored very close to the same, and it's a test model, so let's say they're basically the same. Firefox, behind there a little bit, I don't have Safari on this chart because this are all Windows applications, but the Safari score falls in between. So lots of theory, lots of theory, lots of theory, and then we have this. So we're gonna go ahead now and hand off to our lead engineer Parker. I was gonna talk about some of the concrete stuff, the non-chalkboard stuff, the software stuff that actually makes this work. Yeah, so I wanna talk about the process of actually doing it. Building the tooling that's required to collect these observables. Effectively, how do you go mining for indicator minerals? But first, the progression of where we are and where we're going. We initially broke this out into three major tracks of our technology. We have our static analysis engine, which started as a prototype, and we have now recently completed a much more mature and solid engine that's allowing us to be much more extensible and dig deeper into programs and provide much more deep observables. Then, we have the data collection and data reporting. Tim showed some of our early stabs at this, but right now in the process of building new engines to make the data more accessible and easy to work with, and hopefully more of that will be available soon. Finally, we have our fuzzer track. We needed to get some early data, so we played with some existing off-the-shelf fuzzers, including AFL, and while that was fun, unfortunately, it's a lot of work to manually instrument a lot of fuzzers for hundreds of binaries. So, we then built an automated solution that started to get us closer to having a fuzzing harness that could auto-generate itself depending on the software's behavior. But right now, unfortunately, that technology showed us more deficiencies than it showed its successes. So, we are now working on a much more mature fuzzer that will allow us to dig deeper into programs as we're running and collect very specific things that we need for our model and our analysis. But an analytic pipeline today, this is one of the most concrete components of our engine and one of the most fun. We effectively wanted some type of software hopper where you could just pour programs in, installers, and then on the other end come reports fully annotated and actionable information that we can present to people. So, we went about the process of building a large-scale engine. It starts off with a simple REST API where we can push software in, which then gets moved over to our computation cluster. That effectively provides us a fabric to work with. It makes, made up of a lot of different software suites, starting off with our data processing, which is done by Apache Spark and then moves over into data handling and data analysis in Spark. And then we have the common HDFS layer to provide a place for the data to be stored and then a resource manager in Yarn. All of that is backed by our compute and data nodes, which scale out linearly. That then moves into our data science engine, which is effectively Spark with Apache Zeppelin, which provides us a really fun interface where we can work with the data in an interactive manner but be kicking off large-scale jobs into the cluster. And finally, this goes into our report generation engine. What this bought us was the ability to linearly scale and make that hopper bigger and bigger as we need, but also provide us a way to process data that doesn't fit in a single machine's RAM. You can push the instance sizes as large as you want, but we have data sets that blow away any single host RAM set. So this allows us to work with really large collections of observables. I wanna dive down now into our actual static analysis, but first we have to explore the problem space because it's a nasty one. Effectively, in SIDL's mission is to process as much software as possible, hopefully all of it, but it's hard to get your hand on all the binaries that are out there. When you start to look at that problem, you understand there's a lot of combinations. There's a lot of CPU architectures. There's a lot of operating systems. There's a lot of file formats. There's a lot of environments the software gets deployed into and every single one of them has their own app armory features. And they can be specifically set for one combination, but not another. And you don't wanna penalize a developer for not turning on a feature they had no access to ever turn on. So effectively, we need to solve this in a much more generic way. And so what we did is our static analysis engine effectively looks like a gigantic collection of abstraction libraries to handle binary programs. You take in some type of input file, be it L, P, Moco, and then the pipeline splits. It goes off into two major analyzer classes, our format analyzers, which look at the software much like how a linker or a loader would look at it. I want to understand how it's gonna be loaded up, what type of armory feature is gonna be applied, and then we can run analyzers over that. In order to achieve that, we need abstraction libraries that can provide us an abstract memory map, a symbol resolver, generic section properties. So all that feeds in, and then we run over a collection of analyzers to collect data and observables. Next we have our code analyzers. These are the analyzers that run over the code itself and need to be able to look at the every possible executable path. In order to do that, we need to do function discovery, feed that into a control flow recovery engine, and then as a post-processing step, dig through all of the possible metadata in the software, such as like a switch table or something like that to get even deeper into the software. Then this provides us a basic list of basic blocks, functions, and instruction ranges, and does so in efficient manner so we can process a lot of software as it goes. Then all of that gets fed over into the main modular analyzers. Finally, all of this comes together and gets put into a gigantic blob of observables and fed up to the pipeline. We really want to thank the Ford Foundation for supporting our work in this because the pipeline and the static analysis has been a massive boon for our project and we're only beginning now to really get our engine running and we're having a great time with it. So digging into the observables themselves, what are we looking at, and let's break them apart. So the format structure components, things like ASLRDep, RelRow, basic app armoring that's gonna go into the feature and gonna be enabled at the OS layer when it gets loaded up or linked. Then we also collect other metadata about the program, such as like what libraries are linked in, what's its dependency tree look like completely, how did those libraries score? Because that can affect your main software. Interesting example, on Linux, if you link a library that requires an executable stack, guess what, your software now has an executable stack even if you didn't mark that. So we need to be able to understand what ecosystem the software is gonna live in. And the code structure analyzers look at things like functionality. What's the software doing? What type of app armoring is getting injected into the code? Great example of that is something like Stack Guards or Fortify Source. These are armoring features that only really apply and can be observed inside of the control flow or inside of the actual instructions themselves. This is why control flow graphs are key. We played around with a number of different ways of analyzing software that we could scale out. And ultimately we had to come down to working with control flow graphs. Provided here is a basic visualization of what I'm talking about with a control flow graph provided by BINJA, which has wonderful visualization tools, hence this photo and not our engine because we don't build very many visualization engines. But you basically have a function that's broken up into basic blocks, which is broken up into instructions. And then you have basic flow between them. Having this as an iterable structure that we can work with allows us to walk over that and walk every single instruction, understand the references, understand where code and data is being referenced and how is it being referenced? And then what type of functionality is being used? So this is a great way to find something like whether or not your Stack Guards are being applied on every function that needs them. How deep are they being applied? And is the compiler possibly introducing errors into your armory features? Which are interesting side studies. Also why we did this is because we want to push the concept of what type of observables even farther. Let's say, take this example. You want to be able to take instruction abstractions. Let's say for all major architectures you can break them up into major categories, be it arithmetic instructions, data manipulation instructions, like load stores, and then control flow instructions. Then with these basic fundamental building blocks you can make artifacts. Think of them like a unit of functionality, has some type of input, some type of output, provides some type of operation on it. And then with these little units of functionality you can link them together. And think of these artifacts as maybe sub basic block or crossing a few basic blocks. But a different way to break up the software because a basic block is just a branch break. But we want to look at functionality breaks because these artifacts can provide the basic fundamental building blocks of the software itself. It's more important when we want to start doing symbolic lifting so that we can lift the entire software up into a generic representation that we can slice and dice as needed. Moving from there, I want to talk about fuzzing a little bit more. Fuzzing is effectively at the heart of our project. It provides us the rich data set that we can use to derive a model. It also provides us awesome other metadata on the side. But why? Why do we care about fuzzing? Why is fuzzing the metric that you build an engine or build a model that you derive some type of reason from? So think of the set of bugs, vulnerabilities, and exploitable vulnerabilities. In an ideal world, you'd want to just have a machine that pulls out exploitable vulnerabilities. Unfortunately, this is exceedingly costly for a series of decision problems that go between these sets. So now consider the superset of bugs or faults. A fuzzer can easily recognize, or other software can easily recognize faults. But if you want to move down the sets, you unfortunately need to jump through a lot of decision hoops. For example, if you want to move to a vulnerability, you have to understand, does the attacker have some type of control? Is there a trust boundary being crossed? Is this software configured in the right way for this to be vulnerable right now? So there are human factors that are not deducible from the outside. You then amplify this decision problem even worse going to exploitable vulnerabilities. So if we collect the superset of bugs, we will know that there is some proportion of subsets in there. And this provides us a data set that's easily recognizable and we can collect in a cost efficient manner. Finally, fuzzing is key and we're investing a lot of our time right now in working on a new fuzzing engine because there are some key things we want to do. We want to be able to understand all of the different paths the software could be taking. And as you're fuzzing, you're effectively driving the software down as many unique paths while referencing as many unique data manipulations as possible. So if we save off every path, annotate the ones that are faulting, we now have this beautiful rich data set of exactly where the software went as we were driving it in specific ways. Then we feed that back into our static analysis engine and begin to generate those instruction abstractions, those artifacts. And with that, imagine we have these gigantic traces of instruction abstractions. From there, we can then begin to train the model to explore around the fault location and begin to understand and try and study the fundamental building blocks of what a bug looks like in an abstract instruction agnostic way. And this is why we're spending a lot of time on our fuzzing engine right now. But hopefully soon we'll be able to talk about that more and maybe a tech track and not the policy track. Yeah. So from then on, when anything went wrong with the computer, we said it had bugs in it. All right. Well, I promised you a technical journey. I promised you a technical journey into the dark abyss of as deep as you wanna get with it. So let's go ahead and bring it up. Let's wrap it up and bring it up a little bit here. We've talked a great deal today about some theory. We've talked about development and our tooling and everything else. And so I figured I should end with some things that are not in progress, but in fact, which are done in yesterday's news just to go ahead and make that shared here with Europe. So in the midst of all of our development, we have been discovering and reporting bugs. Again, this is not our primary purpose really, but you can't help but do it. You know how computers are these days. You find bugs just for turning them on, right? So we've been disclosing all of that a little while ago at Defconn and Black Hat, our chief scientist, Sarah, together with Mudge when I hadn't dropped this bombshell on the Firefox team, which is that for some period of time, they had ASLR disabled on OSX. When we first found it, we assumed it was a bug in our tools. When we first mentioned it in a talk, they came to us and said it's definitely a bug in our tools or might be or some level of surprise. And then people started looking into it. And in fact, at one point it had been enabled and then temporarily disabled. No one knew. Everyone thought it was on. It takes someone looking to notice that kind of stuff, right? Major shout out though, they fixed it immediately despite our full disclosure on stage and everything. So very impressed. But in addition to popping surprises on people, we've also been doing the usual process of submitting patches and bugs, particularly to LLVM and QMU. And if you work in software analysis, you can probably guess why. Incidentally, if you're looking for a target to fuzz, if you wanna go home from CCC and you wanna find a ton of findings, LLVM comes with a bunch of parsers. You should fuzz them. You should fuzz them. And I say that because I know for a fact you are gonna get a bunch of findings. And it'd be really nice. I would appreciate it if I didn't have to pay the people to fix them. So if you wouldn't mind disclosing, that would help. But besides these bug reports and all these other things, we've also been working with lots of others. We gave a talk earlier this summer. Sarah gave a talk earlier this summer about these things and she presented findings on comparing some of these base scores of different Linux distributions. And based on those findings, there was a person on the Fedora red team, Jason Cowellway, who sat there and I can't read his mind, but I'm sure that he was thinking to himself, golly, it would be nice to not be surprised at the next one of these talks. They score very well, by the way. They were leading in many of our metrics. Well, in any case, he left Vegas and he went back home and him and his colleagues have been working on essentially re-implementing much of our tooling so that they can check the stuff that we check before they release. Before they release. Looking for security before you release. So that would be a good thing for others to do and I'm hoping that that idea really catches on. Yeah, yeah. Right? That would be nice. That would be nice. But in addition to that, in addition to that, our mission really is to get results out to the public and so in order to achieve that, we have broad partnerships with Consumer Reports and the digital standard. Especially if you're into cyber policy, I really encourage you to take a look at the proposed digital standard, which is encompassing of the things we look for and so much more. You was data, traffic, motion and cryptography and update mechanisms and all that good stuff. Where we are and where we're going, the big takeaways here, if you're looking for that so what, three points for you. One, we are building a tooling necessary to do larger and larger and larger studies regarding these surrogate security stores. My hope is that in some period of the not too distant future, I would like to be able to, with my colleagues publish some really nice findings about what are the things that you can observe in software which have a suspiciously high correlation with the software being good, right? Nobody really knows right now. It's an empirical question. As far as I know, the study hasn't been done. We've been running it on the small scale. We're building the tooling to do it on a much larger scale. We are hoping that this winds up being a useful field in security as a technology development. In the meantime, our static analyzer are already making surprising discoveries, hit YouTube and take a look for Sarah Zacco's recent talks at Def Con Black Hat. Lots of fun findings in there. Lots of things that anyone who looks would have found it, lots of that. And then lastly, if you are in the business of shipping software and you are thinking to yourself, okay, so these guys, someone gave them some money to mess up my day and you're wondering, what can I do to not have my day messed up? One simple piece of advice. One simple piece of advice. Make sure your software employs every exploit mitigation technique much has ever or will ever hear of. And he's heard of a lot of them. He's only gonna, you know, all that. Turn all those things on. And if you don't know anything about that stuff, if nobody on your team knows anything about that stuff, why, I don't even know why I'm saying this. If you're here, you know about that stuff. So do that. If you're not here, then you should be here. Danke, danke. Thank you, Tim and Parker. Do we have any questions from the audience? It's really hard to see you with that bright light in my face. I think the signal angel has a question. So the IOC channel was impressed by your tools and your models that you wrote. And they are wondering what's going to happen to that because you do have funding from the Ford Foundation now. And so what are your plans with this? Do you plan on commercializing this or is it gonna be open source or how do we get our hands on this? It's an excellent question. So for the time being, the money that we are receiving is to develop the tooling, pay for the AWS instances, pay for the engineers and all that stuff. The direction as an organization that we would like to take things. I have no interest in running a monopoly. That sounds like a fantastic amount of work and I really don't want to do it. However, I have a great deal of interest in taking the gains that we are making in the technology and releasing the data so that other competent researchers can go through and find useful things that we may not have noticed ourselves. So we're not at a point where we are releasing data in bulk just yet, but that is simply a matter of engineering. Our tools are still in flux as we, you know, when we do that we want to make sure the data is correct and so our software has to have its own low bug counts and all these other things. But ultimately, for the scientific aspect of our mission, though the science is not our primary mission, our primary mission is to apply it to help consumers. At the same time, it is our belief that an opaque model is as good as crap. No one should trust an opaque model if somebody is telling you that they have some statistics and they do not provide you with any underlying data and it is not reproducible, you should ignore them. Consequently, what we are working towards right now is getting to a point where we will be able to share all of those findings, the surrogate scores, the interesting correlations between observables and fuzzing. All of that will be public as the material comes online. Thank you. Thank you. Thank you. And microphone number three please. Hi, thanks. This is some really interesting work you presented here. So there's something I'm not sure I understand about the approach that you're taking. If you are evaluating the security of say a library function or the implementation of a network protocol, for example, you know, there'd be a precise specification you could check that against and the techniques you're using would make sense to me. But it's not so clear since you set the goal that you've set for yourself is to evaluate security of consumer software. It's not clear to me whether it's fair to call these results security scores in the absence of a threat model. So my question is, you know, how is it meaningful to make a claim that a piece of software is secure if you don't have a threat model for it? This is an excellent question. And anyone who disagrees is, they should be wrong. Security without a threat model is not security at all. It's absolutely a true point. So the things that we are looking for, most of them are things that you will already find present in your threat model. And so for example, we are reporting on the presence of things like ASLR and lots of other things that get to the heart of exploitability of a piece of software. So for example, if we are reviewing a piece of software that has no attack surface, then it is canonically not in the threat model and in that sense, it makes no sense to report on its overall security. On the other hand, if we're talking about software like say a word processor, a browser, anything on your phone, anything that talks on the network, if we're talking about those kinds of applications, then I would argue that exploit mitigations and the other things that we are measuring are almost certainly very relevant. So there's a sense in which what we are measuring is the lowest common denominator among what we imagine are the dominant threat models for the applications. It's a hand wavy answer, but I promised heuristics, so there you go. Thanks. Thank you. Any questions? No raising hands? Okay, then the hero can ask a question because I never can. So the question is you mentioned earlier the security labels and for example, what institution could give out the security labels because that's obviously the vendor has no interest in IT security. Yes, it's a very good question. So our partnership with Consumer Reports, I don't know if you're familiar with them, but in the United States Consumer Reports is a major, huge consumer watchdog organization. They test the safety of automobiles, they test lots of consumer appliances, all kinds of things, both to see if they function more or less as advertised, but most importantly, they're checking for quality, reliability, and safety. So our partnership with Consumer Reports is all about us doing our work and them publishing that. And so for example, the televisions that we presented the data on, all of that was collected and published in partnership with Consumer Reports. Cool, thank you. Thank you. Any other questions? From Stream, are you here or no? Well, in this case, people, thank you. Thank Tim and Parker for their nice talk and please give them a very, very warm round of applause. Thank you. Thank you.