 Hey, everyone. Thanks for coming. My talk is estimating security risks through repository mining. My name is Tomasz Lange, I work at Intel. I maintain Zen and a bunch of tools that are built on top of Zen. And I usually spend my time fuzzing, pen testing, and secure architecture code review. So in this project, I come to branching out from what I usually do. Briefly of what we'll look through in this presentation, we'll look at the motivation of why I am doing this, what the problem statement is, and what we are playing to test, what the experimental design and tools are, results, threats to validity, discussion summary, pretty standard stuff. So to begin, why do we want to estimate security risk? I'm pretty sure you guys are familiar with XKCD comic. Complexity is increasing. All of the projects that we are looking at, especially in cloud scale, have a ton of open source components. And estimating and kind of figuring out which one of those components is going to be the next lock for J is super hard, even if you have an S-bomb, right? Even if you have a list of all the components of your project, figuring out which one of those is going to have an issue is really hard. And doing that manually doesn't scale. As I said, we do secure architecture code review. And it's really hard, even, to just look through the first layer of dependencies for some of our projects. And then the dependencies of dependencies usually fall off the cliff. But to the rescue, there is this new project called OpenSSF Scorecard. How many of you have heard of it? It's pretty new. All right, see a bunch of fans. That's awesome. You get this nice little badge on your project. It gives you the score. 9.4 awesome, you're good. It even color codes it. So it's really, really awesome, easy to understand, exactly what we needed. Perfect. It does this magic by factoring in a bunch of data from repositories that kind of makes sense. Do you have binary artifacts, ship with your repository? That's bad. Do you have CI tests? Is the project well-maintained? Are changes reviewed? How many different contributors your project has? It all makes sense, right? Not all of these factors are weighted the same into the final score. But all of these kind of make sense. Some might not be as relevant, like, for example, having the CII best practices badge on your repository, whether that actually contributes to security risk is questionable. But it's probably not weighted as significantly into it. So the problem statement is, does this actually work? How would we actually know that a project with a high score could actually be considered low risk? There is pretty much no evidence pro or contra released with OpenSSF scorecard. So what can we actually do to try to figure out where we are? Pretty much two things, if you just want to rely on OpenSSF scorecard by itself, is wait and see. More projects will deploy it. Over time, we'll see if there is any correlation with vulnerabilities and CVs popping up. But if you want to check, for example, just with the current set of projects that have an OpenSSF score and existing CVs, whether there is a correlation, that's not exactly good because a lot of projects don't actually care to use CVs. So the ones that actually do have CVs will probably be correlated more with, does this project have a bug bounty versus whether it's actually lower high risk? So you could have this selection bias. So we can't really rely on CVs as a objective metric for testing this. So what else can we do? So if we can't measure it directly, we need to find a proxy. And if we just rationally think about it, if a project is well maintained, it has code reviews, it has static analysis and fuzzing, we would reasonably expect it to have fewer bugs. And we can measure bugs in repositories using static analysis tools. We have been doing that for CC++ for many years, right? So we have pretty good tools. And while bugs don't actually correlate with security risk, the factors that the OSSF scorecard considers should correlate with both if it actually works, right? So let's figure out how to actually do this test. We will find the most popular C&C++ repositories on GitHub. We will run the scorecard on them, if not already installed. And then we will run static analysis. We find a bunch of bugs. And then we will perform linear regression analysis to see if there is actually a correlation between those. It's pretty straightforward stuff. We built some tools to actually help automate this, so we don't have to manually run this on our own laptops or systems. And I've been a big fan of CI and GitHub action, so I figured, hey, let's just run this every month on GitHub actions. It's not really designed for that, but hey, we can do this. Why not? So we'll search and run the analysis, collect the results, and publish everything on GitHub pages at the end. And the whole thing back to back runs on GitHub, and you just get the data set at the end. I spend quite a lot of time digging through GitHub API for GitHub actions. There is a lot of gotchas and a lot of limitations, and it's a pain in the neck. But it actually works now pretty well. It's reasonably easy to extend the framework to pop in new checks. So if you have your own favorite static analysis tools or you want to do your own analysis, you can pop in your own checks here. Pretty much what we are running right now is the OSSF scorecard. And for static analysis, I'm using the Clang scan build, with the ZT verifier enabled on top that helps reduce false positives this can build could have. I'm also running the Clang tidy cognitive complexity analysis and a bunch of other metadata collection, like lines of code and all the different GitHub metrics that you can think of, like how many stars this repository has, forks, et cetera. The whole thing is open source, you can go up on GitHub right now and fork it, run your own stuff. Some limitations on the data collections, right now we are looking at only repositories with 400 plus stars. Right, we wanted to actually only care about reasonably mature projects. So 400 is kind of a number out of a head, but it works for us. Dependencies, that was actually a pretty big roadblock for static analysis. You actually have to be able to build these projects and how do you automatically build just a random project off GitHub, like the hardest part is actually finding the dependencies for them. So, well, figured let's just grab for any apt or apt-get line in the repository and anything that comes after that, we will just try to install it. The whole thing is in Docker, so who cares? So we will just try to install everything. It takes a while, but hey, it's running on GitHub, I don't care if it takes three weeks, it will finish eventually. Well, we are having a six hour limit, but it works pretty well. And three build systems, AutoTools, Mason and CMake. GitHub API rate limit is a bottleneck, right? So we are running all these scans in parallel, but GitHub is actually limiting per token that you have for GitHub accounts, 5,000 requests per hour, which we can actually hit with the scorecard quite easily. And also disk space is an issue, so if you have a project that just generates a lot of files, then you might actually run out of this space. It not works all the time, but it works pretty well. So as I said, it runs monthly, and out of like 4,000 repositories that meet our criteria, we can actually automatically build 2,000. So it's actually not bad, right? Just considering that I'm just wrapping for I get installed and installing all that comes after, being able to automatically build half of CC++ repositories actually impressively wasn't expecting that. And yeah, you can go up and grab all the data sets. We get a summary there, how many repositories we built, how many bugs we found, and how many complex functions there are. So let's look at what we were set out to study. Does OSSF scorecard works as a predictor for bugs? And yay, we actually find that for each increase in the OSSF scorecard, we actually see a reduction in bugs. Hey, this is fantastic, right? You have a point increase in the OSSF score and you have a reduction of seven, eight, nine bugs. Again, each month you see a little variance in the data set, depending on when we actually timed out on the GitHub API request. So you have a little bit of variance on what projects we are actually able to build in each month, but it's kind of consistent, right? So hey, awesome, ship it. But if you actually look a little closer, if you check this chart, do you actually see a pattern here at all, right? That red line at the bottom is actually our linear regression model. Does not really seem to fit too well, right? So while the results were statistically significant, that's just really an artifact of the actual number of observations we have. You actually have to look at this metric called R square to actually see how well your regression model is explaining the data you have and what we see here is that it absolutely does not explain the data we have. The closer R square is to one, the more it fits the data and well we see that it does not fit the data at all. So yeah, practically no connection between bugs and the OSSF score as is. So that's not good, that's not what we expected. So let's look at some of those sub metrics, right? Let's just focus on the ones that we would absolutely want to see some correlation there, like is the project maintained? Well, we actually, based on the data we have, we actually see an increase in bugs. So the better maintained the project is, we actually see two more bugs. It's statistically significant. Absolutely does not explain the data at all, right? Those numbers are tiny, so don't take anything out of this. It's practically just noise, but it's still funny, right? Same thing for CI, do you have a CI? You see an increasing bugs versus what you would expect, but none of those results are actually statistically significant, so yeah, this just does not make sense at all. Figured, well, maybe it's the scan build is not a good statistical analysis tool, so let's take a look at some other stuff. So I tried it with Facebook infer. Again, we see a reduction of bugs with Facebook infer, but it's not statistically significant and our square is bad. I've tried it with this other more experimental tool called Binab's Inspector, which I'm pretty sure none of you guys have heard of. It's a Ghidra-based reverse engineering framework that runs Z3 on the disassembled binary, so it's not even source code based or compiler based. It's this crazy experimental tool and yeah, none of it makes sense. It has a really high false positive bug detection rate, so there's really no wonder that the data just is bad. So let's look at some of those other metadata that we collected, can we find anything that explains the bugs or correlates with bugs, right? Lines of code, that was my go-to, like all right, more code you have. Statistically significant, tiny increasing bugs, but it has a pretty bad R square, right? It does not explain why we have bugs, right? More code you have, you would expect to be a good explanation for the bugs and it's not. Same for comments, well, no surprise there. It's actually more statistically significant than lines of code, which, okay, sure. Size against statistically significant, bad R square, R square is bad all around. If you thought that you can rely on the social network of GitHub as a project, has more stars, more people looking at it is going to have fewer bugs because of that, yeah, don't rely on that. Same for watches, how many people are actually watching conversations on a project? No relations to bugs, it's not helping. Forks, issues, issues actually looks pretty good, right? Number of issues reported, increase of bugs 0.6, 0.46, so that looks pretty good, statistically significant result, but bad R square, it's not explaining all the bugs that we find, so yeah, none of these makes sense. So what is going on here? Can we find anything that explains bugs? Interestingly, if you start looking at number of functions as a metric, we start to see significant results with relatively decent R square, right? Closer it is to one, the better, but hey, 27% is not terrible compared to what we found so far. And same for number of cognitively complex functions. So cognitively complex functions pretty much just means like, is this function readable by a human? It's actually defined, there's a paper, you can find it online that actually explains how the cognitively complex score is calculated, we have a threshold of, for every bad coding practice you get a point, and if you reach 25, your function is considered cognitively complex, so we only count the ones that are over that threshold, and yeah, it's actually a pretty good estimate, right? For every 10 cognitively complex functions, you will get a bug. So hey, that kind of makes sense that it's still significant and it's a good R square. Interestingly though, percent of functions cognitively complex has a really good estimate, right? Each percent increase in the number of functions that you have in your code base that's cognitively complex will add a bug. That's pretty awesome find, right? But it has a bad R square on itself. So maybe just doing a plain linear regression model is not good here. Maybe if we do a multiple linear regression where we actually combine these variables into a single model, they might be able to find whether these variables remain statistically significant and just throw out the variables that are not. So what we actually find is the percent of functions cognitively complex remains statistically significant. It goes down a little bit. What we see is that for each percent, we get a 0.8 bug, and the R square is the best so far. Same for just number of functions plus number of cognitively complex functions. We see pretty good results so far. If we combine all three of those, everything remains statistically significant, and we see a 0.5 increasing bugs for every percent. So yeah, that percent of functions cognitively complex in your project seems to be a pretty good vector and easy to look for data point in any project. You kind of see where you are at. If you look at this model with the other data that we gathered with Facebook infer, for example, what we see is a reasonably okay R square, right? 0.11, 10% of the variance is explained, not really great, but it's at least not close to zero. Interestingly, number of functions no longer statistically significant, so that is not a good variable in this model, so that could be thrown out. So again, it's the complexity, the cognitive complexity seems to be the key here that explains bugs found by Facebook infer. Yeah, bNabs, all the data is crap, so don't rely on that. Now here's some fun. Let's look at the charts of what we find here. Number of functions and bugs cross-reference. Can you spot anything here that looks weird, right? That project there with over 60,000 functions and over a thousand of those are cognitively complex, like what are they are smoking? Here, number of complex functions and scan build bugs, we have some here with over 14,000 complex functions, like all right, that must be a bug or that must be some special project. Or the one here on the left and the top, right? Very few complex functions, a ton of bugs, all right? Now here's my favorite. On the right, 100% of the functions are cognitively complex. That must be some underhanded C repository that people love and it just has a ton of stars, but it's all C, macros and fantastic. Anyway, the point here is that we have a ton of outliers in this data set, right? So do we actually want to include that when we are trying to kind of figure out where we are? Interestingly, there is a statistical method to actually filter out outliers from your data set. It's called Cook's Distance. And with that, we can actually identify 41 repositories out of those 2,000 repositories that have an outweighed effect on our model. And if we actually take just these 41 repositories out, we actually see a 37, 0.37 R square. So right, we jumped 10% in our explanation with the model for the data that we saw. So that's pretty good, right? So we have some outliers in repositories that we scanned. It may be an artifact of just those repositories are what they are, or it might be a bug on our side or in the tools that we used, who knows, but we can kind of control that, filter those out, throw it away, and we still have statistically significant results for all the variables we had. Interestingly now, percent of functions coding thinly complex goes down to 0.27. So that means every four, five percent increase in complexity will add a bug to your project. So that's still pretty good correlation there. OSSF scorecard, even on the filtered data set, still has a good estimate, statistically significant results, but a terrible R square. So OSSF scorecard, even with the filtered data set, no bueno. So R just complex functions more buggy in general, right? At this point, that's reasonable questions to ask. And what we see is that it's actually only 3.34% of the functions that were cognitively complex had a bug. That's a relatively low number, right? Compared to non-complex functions, we see less than 1% of non-complex function. But in terms of total number of bugs, what we see is that 44.7% of those bugs were found in cognitively complex functions. So that means that while only 11.8% of functions were cognitively complex out of the total, they had almost half the bugs that we found. And that's a pretty good indicator that if you want to fix your repository, make it better, focus on cognitively complex functions first, because you have fewer of them, and they likely are the problematic part of your project. So again, we may have screwed up the analysis that we did here. There are a bunch of limitations with the data set. We again used ScanBuild, which is a very conservative static analysis tool. It's pretty good, but it doesn't detect all the bugs. And it does have false positives. We might have had bugs in our data collection and analysis. We only built 2K repositories, and it's only CNC++. So better that is. What we find here is applicable to other projects. Who knows? And again, the only built repositories that actually had dependencies available on Debian. So that might have been a question. Oh, you've got complex functions. Can you please repeat the exact definition of it? The cognitively complex functions. I suggest you look it up. The definition of it is involved. There is a bunch of factors into it. But it boils down to, is this function readable by a human? That's pretty much the gist of it. If a human were to try to read it and understand what that function is doing, will he have a good time? That's effectively the definition of cognitively complex. And then linear regression modeling might not have been the best tool here. Again, the data set we saw on those charts don't necessarily fall in a straight line. So maybe some other modeling would better fit it. I haven't found one that did, but it might be. So some discussion points, right? So where are we? What can we getter from all of this? Like even if all the bugs we found were false positive, we didn't care about bugs. We didn't set out to really find bugs. We wanted to figure out how can we estimate risk? So if we have tools that find a ton of bugs but they are all false positive and they all seem to fall into functions that are also really hard to read by humans. That just kind of means that we have code that is both hard to reason for machines and hard to reasons for humans. Is that code going to be less risky or more risky? Make your own call. If we have our own model that predicts a high buck count for a project and then we have the open SSF scorecard that says this is good, which one do you trust? Seems complexity correlates with bugs. So why would security risk? How are you defined that be different? Another discussion point, scan bill is free. It's out there for a long time. It's very conservative. Why do we even have repositories where we find bugs with it? We should find zero bugs with scan build. Yet we found a ton of them. So I actually turned part of this project into a GitHub action. You can actually throw this on your project and it will actually run scan build on it and you can just run this on every PR and not merge code that has bugs in it. And yeah, in the end it's a pretty hard undertaking to actually correctly estimate security risk. And we absolutely do need automated tools like the open SSF scorecard. But without data, how can we actually trust that those results are good? We get a low score with open SSF scorecard. Definitely go and check out. We'll probably find issues with those repositories. But if you get a high score, at this point it's probably not a good idea to give those projects a pass. So that's not good. And I really would like to see when new security tools are coming out and making claims of simplifying security for the masses, to please show your data. Please back your claims with something that we can reasonably verify and not just guess. If you think we made an error, please publish your data. All are all up on GitHub, including the tools and the scripts. So please take a look at your own scans. I hope this is just a starting point for a discussion and just another point that can be added to the scorecard later. And please don't sue us. Thank you. That was my talk. Any questions? So you explained that you selected your data for repositories that already had 400 stars. And I understand why you did that. There are good reasons for doing that. But I wonder if that self-selected for projects where the variance within that set is very small. And then if you extended that to projects with 40 stars or 10 stars or whatever, then all of those other things that you found were not significantly significant in your set, you would see them having much more of an impact because then you're looking at repositories that we would expect to be less well maintained or have more variance. So did you look at any of those kinds of projects at all? Did you do any kind of sampling of? Yeah, the answer briefly is yes. That could absolutely have an effect on the data that we collected. It could have a selection bias on its own, right? The same way as the fact that we used only C and C plus projects and the ones that had dependencies available on Debbie and all of those would factor into the threats to validate it to the findings we had. We started with projects that had 1,000 stars and then we slowly went down to 400. That sort of already takes three weeks to churn through. So if we wanted to have a cron job that kicks in every first day of the month and have a monthly scan, it was kind of the sweet spot of 400 repositories that actually finishes within a month. So that's kind of why we picked that number. If you wanted to test it with 40 stars, just fork it, change the number to 40 and see where you at. It's doable. I haven't tested that data set myself. Yeah, I mean, rather than doing everything, you do like a random sample of some of the other stuff to see if it does provide value. But thank you. So what is your recommendation for scorecards like OpenSSF? One of the things that comes to mind is if projects enable certain compiler options that engineer way classes of issues, one would be to check if you find less bugs. If you enable a particular Clang option that doesn't allow some attack primitives to exist or bugs to exist in the code, that should be a plus by default, I guess. The other thing is, should we ding people for using memory unsafe languages and then plus one them for using memory safe languages in projects? Is that another viable option? And of course, this is only for memory safety bugs, right? There are logical bugs that are hard to find, probably more easier to exploit as well. Fantastic ideas. I have no answer, but I would love to see the data when you actually do that study and make a judgment call on what the recommendation would be. Could be, either way. Until we see the data, I can't make that judgment, right? And people like to say Rust is going to be the be all solution for security. It's still very easy to have buggy Rust code. So just because you have Rust, until you show me data that actually shows some correlation, I don't know how to make that call, right? My question is, what do you consider as a bug? Like, are you considering only memory safety issues? Are you considering logical bugs? And how do you define it? So the scan build list of bugs that it finds are defined and are documented. The only bug that we disabled specifically was that stores, right, when you actually move some value into a variable that will be discarded immediately after. That was the only bug that scan build also finds by default. We consider that to be noise. So we disable that one. But all the other scan build bugs, you can go up on their documentation. It finds buffer overflows, unsafe function calls, like string copy, basic things. So it's a very conservative. But these are typical memory safety issues, right? Like, when you said give me the data, I mean, when there is a particular thing that is going to protect a buffer overflow from happening at all, like the compiler is going to ensure that this won't happen, the data will show, right? I mean, we can run the numbers. And probably you can run the numbers on projects that have these compiler options enabled. That could be an interesting vector to including the OpenSSF scorecard. But again, we need some data on that as well to see which compiler option actually has a observable effect on the. So I think for you, since you've run the data, it'll also be good for you to say, hey, look, we found this compiler option in this project. We find less bugs, right? And this is how our data is. Because people need some signal on what is the right thing to do here. And there is a feedback loop. I think that might come from you there. Yes. So my recommendation is definitely use OpenSSF scorecard, install it on your repository. Again, over time, we will be able to see if we find a correlation with CVEs and OpenSSF scorecard scores. So definitely don't take away the message that don't use it. If we wanted to measure it right now, we couldn't do that. Because that will take time to actually see whether there is a correlation. We have a question behind you. Could you please share that paper on the complex functions? Yeah, absolutely. Actually, if you Google for ClangTidy cognitive complexity, the first thing that comes up on Google will be linking to that paper as well. It came from Synapsys, which is a static analysis company. They defined it originally, and then ClangTidy ported it. We actually found the cognitive complexity scoring itself to have bugs sometimes when it runs into some funky C macros. So again, everything has bugs. Thank you. Yep. Yes? Were they discarded for reasons of the tool chain or discarded for reasons of the repositories? Again, they were discarded because of this statistical analysis method called Coug's distance. So that actually takes into account the regression model itself and sees which of the observations had an outweigh effect on the model. And it effectively removes these observations one by one and see how far the model changes. And based on that, we can remove it. So it wasn't manually picked. I didn't go through. I don't know what these 41 repositories are. I relied on the statistical analysis itself to point out what are the outliers based on the model itself. So I have absolutely no idea what these projects actually are or whether they are bugs based on the statistical analysis tools or whether these are actually two observations. It could be that we actually have repositories that are as such. And if we remove it, then it negatively affects our model. So we actually see an improvement which is actually not good. But based on the charts that we manually looked at, it's probably reasonable to remove repositories that are 100% cognitively complex. I mean, we don't necessarily want to make conclusions based on some random repository that is probably underhanded C-contest contestants. That's not necessarily going to be applicable to something else. Again, I'm making a guess. I have no idea what these repositories are. When I hear 100% complexity, I think of every Kodak package I've ever read, those things are terrible. And every single device has 1,000 of them. Thank you. Appreciate it. It's two questions. So the first one is whether you consider at some point using maybe data from OSS Fuzz, which likely has less false positives than maybe a static analysis. So OSS Fuzz is actually included in the scorecard as a metric. But the only part of OSS Fuzz is considered here is whether the project has a OSS Fuzz harness. Whether the OSS Fuzz harness is good or not, that is not factored into the scorecard. And just because a project is in OSS Fuzz does not mean it is going to be well-fussed. It's going to be continuously fuzzed. But again, my day job, I do a lot of fuzzing. And it's actually really easy to find projects that are in OSS Fuzz and still find a lot of bugs in them by tweaking the fuzzing harness. So it factors the OSS Fuzz that could be included here probably would be code coverage, would be interesting to include. So not just whether it's fuzz, but how well it's fuzzed should be probably included in the score somehow. Right. And I also wanted to know whether you examined any of the bugs close-up or whether you found any interesting bugs from ScanBuild or any of the other analyzers. Nope. You didn't check them at all? Nope. Perfect. Thank you. If you grew up on the website, but I checked this, we have a sorted list of the repositories based on the highest number of bugs. So you actually have the top 100 repositories with the most bugs for each monthly scan. So you can actually have this kind of wall of shame. If you see your company on the wall of shame for bugs, remove it. Any other questions? We have time. All right. If no other questions, there is one more. No? OK. Well, then thanks. Check out the project and hope to see more studies like this in the future. So thanks.