 Hello everyone, I'm Jan Simon-Muller and welcome to my webinar about static analysis tools and how to use that on Linux using open source tools. You can easily reach me either by email or also on ISC, details are in the slides which will be posted. So today I want to talk about static analysis and first introduce what that means and why we want to use it, what the motivation is. I will show a couple of examples using either compilers or tools which are all open source. So to give you feeling how you can easily make use of these in your projects. And at the end I will show how to integrate that into your make files or integrate it with github, githooks for example into your development workflow or into your CI. And feel free to ask questions in the chat box anytime or raise your hand or speak up. Let's dive in. So what is static analysis? Static analysis or static code analysis basically means we do the analysis while the program is not running. So it is done by analyzing the source code or doing the same during compilation. We compare that then with a certain set. The very first checkers basically did that with, did look for certain patterns. But later on compilers which already do need to interpret the code got that functionality as well because we already need to understand the code so we can apply and look for certain things there as well. So static versus dynamic. So dynamic analysis means that we actually run the program. That's to be covered in one of the upcoming webinars which I highly recommend. So today we will deal with static analysis which means we do analyze the code while it is not running. So in most cases we either scan the source code for patterns or we have this process embedded into the compiler and the compiler will parse the code and produce some sort of intermediate representation IR. Then we can do the code analysis. So that's the general intro. So we do identify defects with static analysis before we run the program. Where does it fit? Basically it fits right after you come up with your new function, your new code. Then we can do the static analysis before we even submit the code or do some runtime unit testing. So basically very early. The dynamic analysis is then done usually during unit testing actually when you run the program and it basically catches a different type of bugs here. So both have in common that they will find errors in different areas. So the static analysis will find bugs, the dynamic analysis will find bugs at runtime then and they are actually a nice complement to each other. It's rather easy to automate the static analysis. They are either separate tools that you can run on your code base or meanwhile they are even part of the compilers themselves. So this is not far and not hard to add. And you can do it very early in your development cycle. So that brings us to the motivation. So why do we do that? Well of course we want our code to be bug free and static analysis can find bugs early. And we can also find bugs that kind of I as a reviewer might not spot, like 30 levels deep in valid access. That's something you will likely not spot in a review when it's, for example, happening in a completely different C file that was touched. So that's quite a useful case and in that regard it can easily complement the peer review. So we can use static code analysis to comply with certain standards or guidelines. So there are coding guidelines like MISRA or the ISO standard 26262 in automotive for example. They enforce that we actually do code analysis to prove the code is bug free or meet certain standards. Especially in certain industries there is a requirement to do static code analysis and that's true for medical, nuclear and aviation. So there we have requirements in the standards to do static code analysis by default. So let's take a look what open source tooling we have and how we can make use of it. So in short we have multiple types of tools available. The simplest and first ones that were done use some form of string or pattern matching. That works for certain types of bugs but we have limitations. The next type of analyzers were then added to the compilers because during the compilation we already need to understand what the code basically does. So we can make better decisions instead of just looking for a certain pattern. Then there are tools that grew out of the Linux kernel community specialized for the Linux kernel and there are tools that grew out of the user space. Of course there are proprietary tools but that's not for today. Let's take a look at the Linux kernel first. Essentially it's a very big code base and right now we are about the 20 million mark, 20 million lines of code and this has always been quite demanding on the tools used. I remember when we first tried things like, can't build a couple of years ago on the Linux kernel that was just taking forever and yeah we had quite some fun with that getting it to work. Originally the kernel has developed tools to help either the maintainers or the developers to check their code. The simplest one is CheckPatch which is merely the tool for the maintainers to check incoming patches for basically patterns. Then there is sparse. We have also Cochinel which there will be a talk next week by Julia Laval and there is Smash which is covered in two weeks. Now with the compilers they did get support for static analysis. Clang was the first one to add this but GCC with GCC10 did add that capability now. So for user space there is quite a large number of tools available. I won't be able to cover all of them. Some are quite generic aka the compiler integrated checks versus some are tailored to be used for finding security issues like floor finders. So somewhere in that continuum we have quite a few tools that can be used. Now what's easy for us to use? If you are using GCC10 then you can just enable minus F analyzer. This is a new flag added in GCC10 and it enables 15 new warnings that will do static code analysis. Clang has since quite some time the scan build wrapper and there is the tool CPP check which is a standalone tool and we'll just take a look at these how they work. So CPP check as a standalone tool you can just point at any of your source files and it will show you any findings on the command line. So this is a pretty straightforward tool just CPP check and your source file done. Now for larger projects you might have to basically write your own script or adapt the make file as shown later on to do that. The GCC10 F analyzer call which has been added is now a very easy way to do that and basically as combined with W error it is a pretty straightforward workflow. You can easily try this for example on this website here. So the website gotbalt.org hosts an online compiler and if we skip out here we can just check this out live by using a compiler version more recent than 10.0 and if we add our flags we get this output here as well so very straightforward. So here we are get to full screen again here. If you want to know more about GCC10 and the F analyzer there's the whole blog post about the development over here. CELANG has support for doing such an analysis for quite some time. It started out with clang tidy which will report the issues found as well. You can actually enhance that in a way that you get proposals back and so on. So that's quite a neat tool. Now this is kind of in the same range as CPP check clang tidy your C file and there we go. There is a wrapper for your whole project that is using LLVM and CLANG and that wrapper is called scan build. Finally if your project has a make file and is using dollar cc in the make file you can use scan build. It will replace the compiler call with ccc analyzer and write out a refinding and in the end you can even browse a web report in your browser. So this is useful for larger code bases other than our 10 line example here where you have a plethora of source files then the view in the browser is actually very useful and can help you navigate through the issues found. Now this requires that your make file is using the usual variable like dollar cc, dollar c flags and so on. If that's not the case or if you have different compiler calls then there is a successor of the scan build tooling and that is code checker. So it code checker is also a wrapper like scan build but the method is different. It will intercept and block the build calls so we do not need to change the compiler used. Then we can analyze it and get a report out of it. You will find it on github they have docker images and more distros have that available. You also need clang installed if you want to use it. It comes with a VEP UI that you can fire up easily there's a docker container for that and it can be used to store your results. So here's how that works. You pull down code checker you compile it add it to your path and then you basically call this code checker log minus b this is the build command so make minus j something and it will write out a file compilation.json sounds more complicated than scan build yes but it does not change the calls to cc in your make files. It basically records all compiler calls to be replayed. And then in a second pass we will analyze the calls and run then clang tidy and clang static analyzer. It certainly takes longer than a plain gcc minus f analyzer because that is kind of built in. This is a two pass approach but it gives you a little more freedom because you do not have to change the original compilation steps. In the end when we have analyzed our code base we can then either write out the findings on the command line we can render them into static html or we can go ahead and upload it to a build in web server database that contains our results. Here's an example so I did a sample run of a certain code base so there are so here we see different runs I just appended the date I just got some random code base I had in my fingers on my desk right now and you see that there are 161 findings. Now not all might be a real issue this has not been fine tuned this is just the stock set up, stock rules as is so not all findings might be relevant in your case. But what I want to show is that this allows us to find issues that are kind of some level steep yeah so either 6 or 11 I wouldn't spot that easily so here we have a case where kind of certain conditions lead to an object pointer being null and you can yes sure of course. So here in the code base a certain condition happens and then we end up somewhere down here with a pointer being null. I just took one example out of here. You can find these rule sets or what scanners are used then in code checker there are configuration files available and so on so you can also set up filters in here. So that is basically the continuation of the scan build capabilities which renders out static HTML into a full blown web UI where you can store your results where you can navigate them where you can identify the issues found. So this is quite useful and we run that on a couple of different code bases and we found issues basically 32 levels deep. Now we did not go into everyone and checked if now that's the case so we did not fine tune our scanners used in that one yet. Okay any questions so far? I wonder if you could talk about your experience with understanding the results. Yeah from from code checker specifically or either way whatever either way yeah the good point. If you compare for example CPP check and well this is a given this is a rather easy example. It is kind of straightforward but yes there can be cases where it's kind of hard to understand what does that mean. But especially the the Selang and CCCF analyzer messages with the carrots over here they tend to be quite spot on nowadays in the recent versions. Not being said if you look at these reports here runs here. Some of the here we have basically 161 issues found. This is not fine tuned some of them might be due to a header not being properly included in the analyze run. Some of I agree some of these are pretty hard to will understand in the first place so you need to go through that and really look at it. Yeah. Okay thank you. Yeah Simone there is another question if you want to. Yes what analyzers should be combined to find more results. The easiest integration nowadays is basically either GCC 10 or Selang GCT 10 with the F analyzer that is now pretty easy to integrate and use. If you use code checker there are already two Selang based systems in use. There are actually two analyzers in use already over here and I'm sure you can extend code checker to do another one. I think given the the results between Selang and GCCF analyzer are kind of equal pretty equal now you tend to pick one of them. Then the question is if you add for example a floor finder if you have any special needs regarding finding security bugs as well. But yeah usually either one of the compiler based is fine and yeah pick another one. Yes. Oh sorry go ahead I have a question after we're done. Yeah okay let's take a look how we can easily integrate that into our own projects. So GCCF analyzer as I said makes it very easy. It's one of the C flags you can add and with that you are basically done. If your make file is already written to use dollar C flags there you go. This is very easy and if you want to enforce it you also set a minus W error and then it will bail out if analyzer finds anything done. Same is true for CMake and other build systems yeah just add it to the C flags. For Clang you can use either ClangTidy on the individual C file or you use CanBuild on your project which will replace dollar CC on the fly and it will not change the C flags but it will change the compiler call used. For one step further use code checker which means you do not have to change anything in your project it will record the compiler calls it will replace them and do an analysis with all options captured. So that's the kind of there are different levels of integration here it depends what you what your CI or what your project setup looks like. CPP check can also be easily integrated in your make file as a separate build target in this case and over here with XML output for example and that can be one of the default build steps that happens. Beside make files there is also another way to integrate that. So if you use jit there is the concept of jit hooks. Jit hook is a mechanism to run well any code shell script binary doesn't matter in certain events and one is the pre-commit stage. So therefore the script needs to be in the .githooks folder and must have the corresponding name and in our case it makes sense to have that as a pre-commit hook. So before we commit anything we do run some action. I prepared two examples so one would be the pre-commit hook using scan build just running essentially make in the tab level project wrapped by scan build and if scan build finds any issues it will exit with one. So which will then trigger this exit one and that means we cannot call git commit. So this will prohibit any code that does not pass scan build to be committed. Similar way is this that's now a git pre-commit hook using cpp check. It's a little bit smarter because it only just parsed the files changed. So that's actually quite nice because we only care about the files changed. Now choose your weapons. Either you do the full scan across all the files or you just check one individual file. Difference is speed yeah so this might be a very easy bar easy addition whereas the scan build or the code checker as a githook will take quite long to actually run and produce the result. That's a question any idea what's the false positive ratio for gcc analyzer. For gcc analyzer I don't have that number right now. Now here for the code checker runs we have quite a lot of these 161 findings which we can well either disregard or even say it's a false positive. In our project we didn't fine tune that yet but I don't have numbers for you right now. So what's your experience with other checkers cpp check and others? In general it seems like there is a flood of information that comes out and it's very difficult to parse. So what's your take on how to do this more efficiently? Yes all those checkers and as I mentioned there are even more reds floor finder and so on they will report a lot of issues to you in the first place. So unless you start from scratch and add this right away. So if you add it to an existing project you will get a big number of issues found and then you need to deal with it. Is there an easy way? Basically no you have to address them in some way so either you say a certain either you limit your scope to certain issues. So for example in code checker you can specify which one which checker names specific issue types of issues you want to be included or you define exceptions for your project. But in general besides filters or exceptions basically the only way is to get through it in some form. Thank you. That's I mean that's the whole point they need to show and show the issues. In general the tools tend to spit out a lot of messages which basically led to the code checker in the end because the output of CPP check on the terminal or CPP check output in XML you still need to parse that and make something out of it. So basically that led more or less to some system like code checker here where you can browse and filter the results post-process the results because yes in large project there will be plenty. All right so basically static analysis can help you improve your code base early during the coding phase very early even before you do unit testing or integration testing. It is a requirement of various standards MISRA ISO 26262 and other industries and it's actually quite easy to automate or add to your to your CI loop. We started this automation for AGL for example there we do use a yokto as a build system and we added the bits and pieces to use code checker with with the yokto builds. So that has been added in our on our side. What is left and that's yeah one point is we would have to fine tune the scanners used fine tune the output used because yes as as Shua mentioned you will get a lot of output and you need to dive through it and essentially go through each in the at some point. Okay any questions? So I put up some examples and the slides also on on the github here and I have some more links and references to the various checkers presented here as well. John yeah speak up. Yeah I had a question about the static analyzers so you're talking a lot about these false positives and I'm wondering these are probably because either you're using really bizarre semantics with your functions or you're just having crazy coding style is there any kind of guidelines so that people can write software that will actually not trigger these false positives? Well I mentioned that mainly because we started scanning already existing code bases if you add this right away basically and enforce it we basically will pass right? The whole point about adding this and using this is to well in case of for example ISO 26262 is to enforce coding rules as a standard or avoid such bugs yeah so that's why we would put that in place. The false positives yeah that mainly is then in code bases where we did not enforce any coding rules. Okay but that means that theoretically all code could be without false positives. I mean I won't come into a situation where I want to program something and it's not possible because it will always have a false positive. I mean I guess it would be nice if I had like a really example so yeah we had to change x y and z and then we were able to get rid of the false positives. Yeah we are not there yet we are adding it to our CI workflow but we are in that phase where we start to fine tune the scanners and start to work on the code base. Okay thank you. Yeah the GitHub link I'll paste that no problem the slides will also be pasted on the the LF page so you will you will get all of that on the webinar page so where is the chat window here? I have one more question yeah do you have any comparison to the commercial product like covariate etc? Yeah this is this is a hard question so I'm watching the covariate reports for various open source projects and see what is done there. They have a little longer history but well I focused on the open source tooling so I cannot speak about the commercial tools. If I may I can add speak to my experience with covariate on Linux kernel it is somewhat dated it does have it does generate a lot of false positives that you have to go look at and make sure they do provide some books to disable some of the positives once you false positives once you analyze the error you could kind of group them and say these are the known things that those are not issues however covariate also suffers from a lot of false positives keep in mind my information is dated from probably 10-12 years ago. Okay thanks. Okay any other questions here? All right well thank you so much Yann Simone for your time today and thank you to all the attendees for joining us. As mentioned the slides will be posted on the Linux Foundation webinars page as well as the video will be posted on the YouTube later today and we hope you join us for future mentorship sessions. Thanks so much and have a wonderful day. Thank you bye.