 So as he said, I'm Sarah Zatko and uh I'm gonna be talking about just the interesting tidbits from our last year's worth of research and then introducing everyone to the work we've been doing with consumer reports and talking about how you can help on that or how you can get involved. So for those who don't know, what is CITL? It's a non-profit research organization focused on being sort of like consumer reports but focused specifically on software risk and safety. So some people get confused, they think we're just evaluating security software, we're evaluating all software. Um and that includes desktop binaries or things off of IOT devices, that's sort of what we're getting into now, there'll be a little bit about that later. But uh so anything that's a binary. And uh the goal is that we test things and then publish results just like consumer reports does. Right now where we are in that process is we've got the testing part pretty solid but uh we're working now on coming up with nice looking readable reports, uh scaling everything up so that we can test everything rather than you know just uh you know a product here or there when you want to do an entire uh industry like this it uh requires more scale than you initially set up for a proof of concept. And then uh we're building up our partnerships for publishing because again we're not publishers, we'd rather give data to people like consumer reports and other organizations and let them disseminate the information to their audiences. So that's the sort of stuff we're working on right now. And the end goal is of course to make the software industry safer and easier for people to navigate. You know uh people who care to make the right decisions about minimizing their risk have no clue how to do it cause not any fault of theirs but just an entire lack of usable information. So that's the uh problem we're looking to solve in the long term. The uh our software testing right now we do static analysis which is different than most people's. Our static analysis is that we look at what application armoring features are there and how well are they implemented, what functions were called and which of those are historically really risky or just straight out bad. And uh um what libraries are people linking to and what are the features of those libraries and then you know complexity of the code. And we can measure all of that to our satisfaction but there's a problem when you want to take the 100 or 200 features you extracted about a binary and you want to turn it into a single score. Like this got a 75 out of 100. Because it's hard to tell what thing is worth how many points. You know are like stack guards worth 15 points or 10 or 20. I don't know. Because there haven't been a lot of studies that look at how effective are the safety measures that we use in trust. The um in most other industries that sort of data would be pretty fundamental something you could take for granted that it existed because the first step is that somebody comes up with their bright new idea and publishes how and why it works and we hope one day everyone will be using it. But before you get to that universal adoption there's clinical trials or there's small scale deployment and there's studies to see did it do what we thought it would do and how well did it do that thing. This security industry doesn't do that. I hadn't realized what a weird blind spot that was until I needed that data for this effort and then I was like wait we don't do this at all. That's I mean it's just sort of mind blowing when you realize that this is an entire gap but it makes sense because that stuff's not as sexy it's not as exciting so people and there's enough other problems that people don't do it. So that's what we're focusing on in the immediate future is that we have a lot of static analysis data. We're putting together a lot of dynamic analysis and fuzzing data to go with it so that we can do those studies about how impactful are different elements that we're looking for and how much do they affect a final score. Uh that way when we do finally publish results there'll be numbers we're ready to get in a fight over you know things that are really solid and then we really stand by. And this will when we publish that corpus of data the fuzzing data and the static analysis data that'll be a big deal for a lot of other people too because we're not the only ones who are frustrated by that lack of quantification for impact for the things that we hope to be industry standards. You know if somebody is looking to push for universal adoption of a particular safety feature say you know the FTC or someone like that it's very hard to make that argument without the study saying this is why you need this. And uh you know it's also I believe that this gap in the body of existing research is at least partially responsible for the success of snake oil salesmen in our industry because for the stuff that really has substance and does something and the snake oil the same argument is being made. I'm an expert and this works trust me. You know the non-expert doesn't have the sort of data that they'd have for any other decision making process and it looks the same to them so whichever one's better marketed and prettier looking is the one they're going to go for. Also snake oil people usually make more broad promises so you know they'll go for the one that says this will make you totally secure not the one where they're like yeah I think this will help you know. So I'm very excited about the research we're doing right now to build up that corpus of data and that's what we're hoping to have as our next big thing is releasing that. But for now on to our talk and what do we have uh to show you today. So uh as I said we're ramping up our fuzzing capabilities so that we can build up the second half of the data we need for our big longitudinal studies. And uh so we're going to look at the really early stuff coming out of there and talk a little bit about the fuzzing framework that we're building up. And then we're going to get into the interesting tidbits from our static analysis so far. So we're going to look at comparisons of major OS's, a couple smart TVs, a couple Amazon AMIs and then for applications we're going to look at browsers and all those major OS's. And then uh revisit Microsoft Office for OSX cause uh there's an interesting case study there. And um finally we're going to talk about our work with consumer reports on the digital standard which again I'm very excited about. It's nice to have so much work to do that you're really uh feel like it's going to make a difference sometime you know. Uh okay so fuzzing. We fuzzed about 300 binaries so far which might sound like a lot in some context but uh given that we've done static analysis on over 100,000 it's not where it needs to be yet. Um and what we're building is a system that can do fuzzing in a fully automated fashion. So set up a VM, install the software to be tested, take a bunch of existing test harnesses and try them out on the binaries. If something fits then you fuzz it and then do the crash triage to identify primitives all in an automated fashion. When most people are doing fuzzing their goal is to hit a particular target. But what we want is just breadth. We want as many different kinds of binaries as possible. So if something doesn't work we just ditch it and move on to the next thing. And you know so that means we can build up a larger set of data although probably a little idiosyncratic looking to somebody who had a different goal in mind. And uh to do this we're customizing AFL and Triforce QMU and also borrowing some stuff that are the cyber grand challenge. And uh that put together is called CITL fuzz very creatively. Maybe we'll have a better name for it somewhere down the road. Uh names aren't totally my specialty if you saw the talk title. But uh so we fuzz about 200 binaries with AFL and about 100 with CITL fuzz so far. So for the CITL fuzz results those are it's not a terribly rich data set at the moment but one interesting thing you can do is look at the packages where we fuzz multiple binaries and see how well did those packages do as a whole. So um what we have here is package name, number of binaries, how many of those crashed, that as a percent and then total number of unique crashes for that package. And uh you can see the winners here are Lib C Ben and Cook uh with five and four binaries respectively in zero crashes. Those are the biggest sets of binaries to not have any crashes found. Uh the runner up would be Canna which had a handful of crashes for its six binaries but that's still better than anything else in its range of uh uh binary counts. The losers were uh BioSquid, Elf Utils and then CDF Tools. The first two of those had 100% crash rate for their eight binaries tested and BioSquid had several hundred crashes per binary. So that was definitely the uh biggest loser out of our test so far. For uh the AFL data we had static analysis to go with it so then we can start doing some really preliminary data science of the type that we're going to be doing more and more of as our data set gets richer. Right now uh we're only getting numbers of unique crashes and the signals but later on we'll have primitives identified and then we'll be able to do more complex uh math magic. But uh for now we have the numbers of unique crashes and then we tried correlating those with risky and bad functions. You know so risky and bad functions mean what you think. They're the functions that somebody looks for when they're bug hunting cause they know that people screw them up a lot. Uh bad is worse than risky you know. Um so uh the highest correlations that we found were for EXEC LP and EXEC V which uh you know had a pretty high correlation rate with numbers of crashes uh and you know this is again really early days in a pretty small data set so it's an exciting early tidbit and tells us that there's going to be some really cool stuff when we uh get this fully up and running. So uh onto our static analysis data where we've got a lot more to work with. Uh so first up operating systems. Um you know an operating system has thousands maybe tens of thousands of binaries. Each binary has a score. So the way we show an entire environment and it stands is that we do a histogram of all the scores in that operating system. And uh you can't see the numbers here cause I wanted to get too much on one slide but you don't really have to. You just need to see the shape of the distribution. Uh so the X axis here are all the same. They go from negative 10 to 110. And uh low numbers are bad. They mean softer targets. High numbers are good. So the further to the right your distribution gets the better off you are. And uh the Y axis is not the same on all of these cause it made things sort of unreadable. Uh but it's not really important for the story here. Uh cause you know um what we're looking for again is the shape. So what we've called out is the 5th, 50th and 95th percentile marks cause this allows you to see how long the tails are in either direction. We particularly care about the left tail cause that's the low hanging fruit in your environment. The longer that tail gets the more soft targets you have lying around and the less you've done to clean up that low hanging fruit. So um Windows 10 actually does very well. Uh which you know you might know from seeing the new safety features and improvements that they had in that release. Uh the 5th percentile mark is pretty close to the 50th. And uh they're really very consistent in the application armoring features that they're including which is why the distribution looks so different than all the other ones. Uh it's a lot more uniform in the safety practices so then you end up with a lot more things in the same bin. So for example that biggest bin there is the 65 to 70 and there's 5500 binaries in there. So that's the most consistent off the shelf uh use of safety features that we've found so far. Uh OSXL Capitan has scores in generally the same region on like the chart as OSX I mean as Windows but it's uh the 5th percentile mark has moved a lot further down. So it's uh got a lot of the same safety features but they're not being as consistently applied. And uh there's just a lot more low hanging fruit hanging around. So the percentiles all move down a bit. The 50th is almost the same but the thing we care about the most is that the 5th percentile mark went down by 14 points. And then on Linux again the 5th percentile mark moves noticeably further uh down the scale. And uh the distribution gets a little lower as a whole. So you know this isn't any sort of surprise to most bug hunters because this correlates exactly with the prices for exploits in these environments. And that makes sense because we're trying to assess the cost to newly exploit software and uh so you know the uh you get paid for your work. The thing that's harder to exploit you get more money. And but it's nice to know that the pound owned values for these exploits correlated so closely with our results. It's good confirmation that we're measuring what we want to measure correctly. And the last bit of data that we have for standard OS's is that more recently we got uh some data on Sierra. So then we can do a comparison between Sierra and LCAP to see what improvements have been made. And uh so uh Sierra got a bit bigger. No surprise there. Everything always gets bigger. And then uh ASLR's application rate on LCAP at tan was already pretty high but it inched a little closer to 100 percent. Um the heap depth was the only safety feature to have a decrease in use. Uh heap data execution prevention. You know one of the exploit mitigation features that we look for. Uh and this is explained by the fact that most modern OSX compilers if you compile something as 64 bit then the flag for heap data execution prevention is off. So the increase in 64 bit matches pretty closely with the decrease in heap depth. Uh this is a worrisome trend in that anytime you have a flag that's off uh there's the chance that that safety feature will not be used. Some people argue it doesn't matter that most modern OSX environments enable heap data execution prevention by default but that's not always true. In some cases it'll see you don't have this flag and it'll say oh you're not compatible um I'll turn it off. And uh you know there's nothing that will go bad if you have the flag on. So have the flag on if you can do it. You know uh so I'm not clear on why that's a trend but it is and uh it'd be nice if that got turned around sometime soon. Uh we also look at source fortification. And uh uh the number percent of binaries that were fortified was about the same although the number that were unfortified went down so good. Uh you might notice that these numbers don't add up to 100 percent. That's because the way we tell whether there was source fortification is that we look for the unfortified versions of these functions and we look for the fortified versions. And you know if you don't have either in your binary we don't know what flags you had and it doesn't really matter you know because it's irrelevant it didn't make any change. And then uh the last bit is looking at what percent of the binaries had functions from our different categories. So good, bad, risky and ick all mean exactly what you think. Uh ick is just the couple functions that no one should ever use in commercial code. Uh there's not really it is just two or three functions so you know it's uh you have to be special to get in there. And uh here again the story was pretty positive. Bad and risky both went down by about ten points uh I mean ten percent and then uh good was actually the biggest increase we saw across the board uh going from eight percent to twenty five percent of binaries with good functions in them was you know uh really pleasant surprise. And uh so next I'm going to look at a chart that shows the impact of corporate policies for software installation. Uh so I'll do some explaining on this one just bear with me. The uh blue bars the ones that are in the middle row are the base installation for large enterprise organizations well large-ish like thousand plus employees. Uh their base OSX install for new employees. So this is the clean installation everything that's on it when they give it to the employee. The little gray bars that you can hardly see in front of the blue bars are what executables from that clean installation their employees actually use. So which ones got executed. And then the giant orange bars are what executables get run by employees when the company gives them the right to install whatever software they want. So the if you're looking at the live attack surface being presented by your organization those gray bars versus those orange bars are the comparison. The uh um and obviously you can't have just those tiny bars you have to allow them to install some things but being judicious in what you allow is important because if you aren't you're increasing your attack surface by a sizable amount. And the in particular the low hanging fruit increases a great deal. The negative scores for the clean install stop at negative ten but on the dirty install they go down to negative thirty five. So you're increasing your lower tail by a lot. And again that's not something that's a surprise to anyone but nobody's quantified it and shown somebody a chart of you know here's what you're doing with this policy. Um okay and next we're going to look at a few different Linux distributions. The first one is the same Linux that you saw earlier. This is what you get as the off the shelf Linux 16, a bungee 16. And then uh the one below it is an example of what you get when you take an off the shelf Linux install and do a modest hardening effort. So no code modification but you take out the stuff you obviously don't need and you recompile it with all the modern safety features enabled that won't get in the way of normal operations. You know this is a very reasonable thing to expect from any major vendor when they're putting out something with a Linux environment. Obviously that's not what they do right now because nobody makes that modest effort if no one's going to notice. Uh so what you see if you take apart a smart TV and get the binaries is uh the two things we have on the bottom here. Uh Samsung smart TV and then an LG smart TV. And uh these are worse than off the shelf Linux. They're lower distribution uh they have lower points overall because they're missing a lot of fundamental safety features. Uh you know they have they're smaller. They're 4,000 and 2,000 binaries respectively but they're clearly not uh doing all the things a modern Linux environment does from a safety perspective. And uh for comparison if you look at Amazon's AMIs for Linux environments they're the same size. These are both 2,000 binaries so you know very comparable footprint. But this is what happens when you care about the security of your Linux environment. The distributions are much higher and they've taken out all the low hanging fruit. The scores on the other ones went well into the negative numbers but here the minimum scores are 40 and 30 respectively. So that's a nice accomplishment and it shows that it's entirely possible to have a full Linux distribution in the footprint that these IoT devices have while doing everything more correctly. So to see why those smart TV scores were as low as they are here's uh some data similar to what we saw for the OSX stuff uh just showing all the safety features that the smart TVs don't have. Uh I've highlighted in light orange and light red the uh bad and really bad numbers respectively uh you know math terms. And uh I'm comparing it to the Hardin Linux because they should have made some effort to Hardin. That's the bar they should be compared to. So uh ASLR is lower than the standard because the standard is that we're getting really pretty close to 100 percent. And then uh Relvo was almost non-existent that's uh a Linux specific application armoring feature that's really pretty common these days. And uh stack guards and fortification were basically non-existent on LG. They were there on Samsung but much lower than industry standard. And you know that's all reflected in their scores. So that's it for uh full environment views of operating systems. Now we're gonna use those histograms as the context for scores for individual applications. And so you see the same histogram that we had for OSX earlier. Only now it's showing you where the different application scores fall. And below it we've got our hardening line, soft targets to the left, harder targets to the right. And uh we're calling out where specific applications of note fell on that. So for OSX browsers uh the Firefox score was not great. Uh Safari was middle of the pack and then Chrome did decently. Not amazing but decently. And uh you know this is uh I don't know not maybe uh or the shattering result to some people but again it's nice to quantify things. For Linux uh just to have a third one to throw in we did opera. They're clearly a little bit further behind the times in terms of some of the safety features. Um and uh for Chrome and Firefox they presented an interesting case. So here first I'm showing the scores that they got just for the executable, the main executable alone not taking into account the libraries they're calling. I'm doing this because viewed that way they got the same score. Granted uh the one bit of static analysis that we're still working on is our evaluation of sandboxing. If when that gets finished Chrome will probably gain a few points so they'll probably uh get a little further ahead of Firefox there. But right now they're neck and neck. Except when you bring libraries into account where Firefox falls 15 points behind Chrome. And you know this is important to point out because this was Firefox's home turf, this was their chance to shine. But you know if you bring in libraries that are scoring negative 10s then that's gonna you know be reflected in your uh permanent record. And uh you know they uh a lot of times when people are figuring out what libraries to use they think first about functionality cause that is admittedly very important. As is what license is using. But third place should be its security stance. Uh people fall into the trap of not my code not my problem but it's code you're linking to from your product and choosing to execute when your customer runs your product. So you should care about it. Uh that mini rant over onto Windows browsers. So uh here we looked at Firefox, Chrome and Edge. And uh again Chrome will probably gain a few points when we get the far uh sandboxing thing in place. But for now Edge being on its home turf uh got a near perfect score. Slightly better than perfect cause they did they got bonus points for a couple things very few people do. And uh so the not a huge surprise there Firefox uh is again away from their home turf so they fell down a bit. And uh if we were going to pick one of these Firefox binaries to count as the Firefox binary uh we would pick the 32 bit cause that was the one we got when we went to their main website and clicked download. You know that I mean it didn't have it explicitly labeled as 32 bit but when we checked that's what it was and you know it took a little bit of hunting to find the 64 bit one so your average consumer is not going to do that. Uh so you know you get credit for what you make available. And uh again the point to own values for exploits match up pretty closely with the uh rankings that we'd give these uh applications. You know Microsoft Edge and Google Chrome are at the top, Safari's in the middle and then Firefox is relatively cheap. Um and then our last bit of static analysis data that we're looking at is Microsoft Office for OS X. Last year we beat up a bit on them for Office 2011. People rightly asked okay what about 2016? So we bought that and tested it and uh presents an interesting story. Uh so here's those bad uh 2011 scores that we had last year. Uh in particular their auto update scoring a seven was pretty embarrassing. And their average for all binaries in that package was around a 16. So uh you know just uh not a good showing all around. But when you look at Office 2016 uh they really did much better. The average binary score increased by 60 points. And the auto update went from a seven to a 64. And uh you know when a new Office suite comes out I know I'm not in a big rush to get it because all of the features that I need I already have and I'm using in the current one and I know that all that's gonna happen is they'll have moved a bunch of buttons on me and it's gonna take me a week to be productive in it again. And so you know it's not one of my priorities but sometimes there's more hidden features and this is an example of that. So if somebody who's responsible for purchasing decisions for a large organization have this sort of data available then they'd be like okay guys sorry you have to find out new locations for buttons. Like you know the we're upgrading. So uh done with our data that we've collected. Now on to the work we've been doing with consumer reports. Uh in particular the digital standard which launched earlier this year and uh that you can go to this website to see the first draft we put together. And what this is is that consumer reports recognizing the need for software evaluation for their own product reviews. Uh asked us to be part of uh group there reforming. Uh the other groups are all focused on digital rights uh privacy. One of them is focused more on governance side. There's data sharing. You know they're trying to be a pretty broad view of uh consumer digital rights. And uh together we put together this uh uh first draft of the digital standard which is our first step at a testing standard for how you would evaluate software or IOT products from a digital rights perspective. And uh we've done a couple initial rounds of testing based on the standard we put together uh including the browsers and smart TVs that you saw earlier in the talk. And all those other organizations tested the same devices but since the full report hasn't been published yet I'm just sharing our uh side of that cause that's the part we're allowed to share. Um right. So if you go to the website and look at the digital standard this is what you'll see. Uh everything gets more technical the further to the right you go. So the left most column is the name for a particular criteria like vulnerability disclosure program. After that is a lay person readable description of what that criteria means. After that is a slightly more technical description of what indicators we're checking for when we're evaluating that criteria. And then after that is a brief description of the test procedure. So if something has a little green check mark by it like this one does. That means we're pretty happy with this one. It's pretty fully baked. And we feel like we have a test procedure in place that scales well enough for consumer reports' needs or some other similar kind of effort. If something has a little flask like the next one does uh that yellow flask means usually that we can test this for a single product but we don't think it scales well enough for testing an entire product vertical. Uh so you know there'll be a test procedure there but it's something that needs work in order to figure out how to make it uh be more automated. And then if something has a little red exclamation point that means that none of the groups that are part of this coalition right now have an in-house capability to test this thing. And uh that's why that column's blank. And uh so if you go to the site all of this is linked to the github page so you know if you read a criteria or a description of something and you think we didn't word it right or we're barking up the wrong tree you can click and go to github and comment and propose a change. If you have some way of turning one of our orange or red icons into a little green check mark then you can let us know that and we'd love to get input on how to make our overall procedure more complete. The there was some question as to if we don't have the ability to test something why are we including it? It's because we think those things are important and that an evaluation of a product would be incomplete if it wasn't there but uh you know and us not having the in-house capability to test it right now doesn't change that. And in this case what I mean by us is the total uh all the groups that are part of the digital standard not just CITL we're just a little we're only responsible for a couple of these criteria thankfully. So um if you are interested in the digital standard and want to be a part of that I uh encourage people to go to the website and uh get on github and or look at our bits where we're missing something and think about whether that could be a new research project. Um and then uh if you have anything follow up for con for CITL our contact info is at the top here. So thank you very much. And uh are we doing questions? I don't know. Yeah I can do ten minutes of questions. We're a non-profit so uh we're figuring out a way to support the organization while sharing as much data as we can. Uh but public service is the main goal. Uh sorry I forgot to repeat that question. Uh but he just wanted to know whether uh what the business model was. Yes. Uh so he wanted to know if we're looking at historically impactful b- vulnerabilities and those binaries. And we're not looking at historic data right now because we've got more than enough present day data and we're uh using the fuzzing data to compare against when we publish our corpus of all the static analysis data and fuzzing data. Then somebody else would be welcome to do that study. It's just you know the so many hours in the day kind of deal. Anyone else? It's a little hard to see the audience. You. Could you say that louder? Uh so are we looking at dependencies for software? Uh if it's a library that it links to then we look at that and then we look at the libraries that links to. We go down that whole tree. If it's some sort of more indirect uh dependency that we can't see through library linkage. I don't know what that would be but then we wouldn't find it. But yeah we look at the whole tree of library dependencies. Anyone else? We're all good? No it. There. Okay we're all good. Okay great. Uh oh wait you have a question? I'm not currently on Twitter. What? Yeah we're in between DARPA officers. Okay so everyone's good. Uh then have a great DEF CON.