 Hello, my name is Marek and in the next 20 minutes or so, I will show you the main ideas behind our tool symbiotic and hopefully I will also show you how you can use it. Nowadays, the main method that we use to find bugs in software is to use testing. But you surely know that writing tests by hand is tedious work and usually it doesn't work that well, so we would like to generate the test automatically. In the first talk today, we saw that we can use for example fuzzing to generate tests, but one other method that you have probably already seen and that I would like to show you now, we kept now, is to use symbolic execution. Symbolic execution is in principle quite simple method where you instead of the inputs, of the concrete inputs, you use symbols. Symbols represent arbitrary value that the variable or the input variable can have. And with these symbols, you just execute the program. And as the program runs, the instructions in the numbers generate some expressions over these input symbols and the interesting part comes when you run into branching in the program where if you would have numbers, then you would know which branch you should take. But when you have the symbols that represent a set of numbers, then it is possible that you can go both ways. And if it is really possible, then in this case you need to work the execution and follow both paths. This way you can explore all paths in the program, of course if there is finally many of them, and generate test cases for any path in the program. Better to show on an example, there is some function that takes three inputs. The function where it does nothing is just to demonstrate the symbol execution. At the end it checks where the sum of the inputs is equal to zero. And so the symbol execution, it will assign symbols to the input variables, let's say alpha, beta and gamma, and starts the execution. And right at the first instruction, which is the if statement, then a is alpha, so it tries to find out where alpha can be less or zero or greater than zero, just the one branch that it can go and this the other branch that it can go. And since alpha is an arbitrary value, it is possible to go both ways in the symbol execution of branch. And this way it proceeds further and further, and for example here it collects the constraints that it finds on the branching, so for example here it tries to execute this statement where a is alpha, b is beta, and c is two because it was assigned to zero. So it tries to check where alpha plus beta plus two is less or equal to zero, given that alpha is greater than zero, beta is greater or equal to zero. And it finds out that there is only one way it can go, because one of the branches is invisible in this case. And this way we can try to explore every path in the program and if there is fine too many of them and we have enough resources, enough time, we actually succeed. What about integer overclock? Overclock in the variable is undefined, but the overclock on unsigned int is possible. Okay, it depends on the theory that you use to decide these constraints. The point is that if you use the function and it has unsigned int numbers, then basically they can overclock. Yes, what is the theory that you use to decide these constraints? It's usually you use the bitvector theory, which means that you really model the numbers as bitvectors. We model numbers as natural numbers in mathematics. Sorry? So you model numbers as... No, no, we model numbers as bitvectors. It's really the fixed size bitvectors that you find in computers. In this case, you find the overflows. If you model them as natural numbers, then you do not find the overflows. And of course, then you can get infinite paths, even in the bitvector aspect. You can find overflows this way. It just depends on what you want to find them. Okay. Okay, and this way we can basically partition the inputs into sets where each input from the given set will drive the program a particular path in the... execute the particular path in the program. One of the tools that we can use to perform simple execution is called Cree, which is an open source tool that runs over LLM bitcode. And since it runs over the LLM bitcode, then we first need to take sources, compile them into LLM, then we can put them to Cree. So here comes the first just short demo. I have the function from the slides. And the error function is defined as assertion. And what I need to do to use Cree is to annotate the program to say which variables are symbolic. I do that with this CreeMakeSymbolic function, and then I can compile it and run it. So I can use Clank to compile it into LLM. Maybe get some warnings. I can run Cree on the generated bitcode. It will show me that it found three paths in the program out of six that valid the assertion. And this is how we can use Cree, and we actually use Cree in our tool too, but we use it after some transformations of the program. Where's the problem with this approach? First, you need to annotate the program. And second, the symbolic execution is computationally very demanding. In the program, you can have many paths, a large amount of paths. You can have infinite number of paths. And there are some techniques that we can elevate this program a bit. And what we do is that we try to pre-process the code somehow before passing to Cree. And we try to pre-process it such that the symbolic execution is faster. So one of the quite obvious things that we use is to use code optimizations. That's the first step, but nothing surprising. The next step that we do is that we use program slicing. Program slicing is technique that can somehow remove from the program the parts or the instructions that are not in our interest. How we do that? We need to compute dependencies between instructions where the instruction A depends on instruction B if A consumes some values that B generates. Or when B controls the execution of the instruction A. And once we compute the dependencies between instructions, we can slice the program, which means basically that we keep only the instructions that can somehow influence the values at their error location and the reachability of the error location. And yes, just a small example. This is some short code where when we have some cycle that zeroes the buffer and if you are interested in this assertion that checks that there is no auto bound error, then we can remove completely the buffer and when we have the dependencies in the code computed, we can do it automatically. Okay, we removed basically one or two instructions, but the main point is that, for example, this instruction has access to memory and it is in cycle, so it can have quite good effect to remove this instruction. And of course, when you have bigger program, you can remove more. Okay, so this is program slicing and it works well when you have the assertions in the code and if you are interested in the assertions. But we try to look also for other types of errors like dangling pointer-deference and this. And there is a problem that in this case, we do not know the targets of the slicing, so-called slicing parts here. Because we could take every pointer-deference and set it as the slicing criteria, but then the slicing wouldn't help much because we wouldn't slice anything away. So what we do is that we run some fast static analysis that tries to find possible errors in the program and that just looks at the program, processes it, and divides the instructions into two types. One type is the safe instructions that cannot lead to the error that we are interested in and the other that may possibly exhibit the error. This is our approximation, so it is possible that the instructions that just may exhibit the error are actually safe during runtime. But we do not know, so we take them as potentially unsafe and we set them as slicing criteria. And now we know which instructions are interesting for us with respect to the error we are looking for and can slice with respect to these instructions. After these three steps, we can again perform code optimizations because the program slicing and the first code optimizations mainly the program slicing can enable some more aggressive optimizations of the code and the code may change again. Okay, and apart from this transformation that I just mentioned, the symbiotic automatically marks the memory symbolic so we do not need to manually annotate the program with Climax Symbolic and replace these undefined functions with some symbolic steps. That's just because we do not want the user to somehow change the program before passing the symbiotic. What are our limits now is that we do not support C++ mainly because of exceptions and we do not run on parallel programs yet. And of course there are problems with scaling because the program transformations cannot solve the problem of the scalability of symbolic execution and it can just help in some cases. So short demo. Okay, just to see that this is the same code as before, just we omitted the annotation with Climax Symbolic. So in this case, the default mode of symbiotic is to take uninsured variables as symbolic. It's a bit problematic because in this case it's actually undefined behavior, but the user usually wants something like this. Okay, so we can have a symbiotic on this program. The default mode is to look for assertion violations and we get some messages that start slicing, blah, blah, blah. And it found an error here in men with these values of variables. Okay, that's the same example as before. Or I can show you this example. There's some singly linked lists. It creates two elements in the list. There's some assertion. Okay, and we can try to run symbiotic on this code and it says that no error found, which means that the assertion is not violated. That was the default mode of symbiotic that looks for assertions, but if you want to look for memory safety errors, we need to sell it to symbiotic. So we sell it to look for property memsafety. In this case, when you run it, it finds that there is memory leak in the program. And the memory leak is that we haven't freed the middle element, so we can fill it in, try to write once more, and now it works. So no error there. The last example, I have here the code of valent, which is an inter-processed communication library. I guess you know it. And just out of the file, I filled in this main, where there is some structure that represents a circle buffer. And this structure, it has some data at the tail. And there is some function real buffer puts that just copies data into the buffer. So if we try to create a buffer, copy this hello world string into the buffer and assert that the buffer size is the same as the string, we would explain that that's true. So we can run symbiotic on that. And it repels an error because of compilation that it cannot find the FFI header file, which is quite usual problem with compilation. So we can fix it, I hope. I have it somewhere. So I just will put it to work now. Okay, and we saw that it works now. It optimized the code, sized it in some time. And it found that the assertion is violated. What's the problem? The problem is that the buffer was initialized. So the model that symbiotic found for this, or this case that it found, was that the buffer was initialized to zeros. And then the variable, there's probably the head variable, was some garbage and the tail variable was again zeros. So okay, we can fix it. Okay, let's see where it works now. Yes, now it works. And just to show how it would look without... without slicing and optimizations. So now it took like seven seconds. Alright, now without slicing and optimizations, well first we see that much more parts of the code needs to be somehow modelled by the symbiotic. And now it takes like 18 seconds. It's not that big difference, but on bigger programs it can be more significant. Okay, so that's basically how you can use symbiotic. So in the future we would like to somehow solve the issues with scalability, which in some cases there can be also problem in slicing. So we would like to employ faster analysis to slice the code faster. There's the inherent problem of symbolic execution, which can be somehow fought by, for example, abstraction or we could instead of symbolic execution use some different tool, something like claims static analyzer, which actually cannot be used for this purpose, but something like that, which is less precise, so it can give some errors that are not real, but it is faster and is able to run on bigger programs. Then we would like to model the mainly the post-ex environment better. I mean it works somehow now, but it's not perfect. And of course we would like to fill in C++ and threads because once symbiotic supports C++ then we can run symbiotic on symbiotic is written in C++ and to conclude, just symbiotic is a tool for finding or faster finding bugs in C programs and it does it by combining static analysis with program sizing and symbolic execution. After this now it runs only on sequential C code and still has some scalability issues. That is all from me and thank you for your attention. Of course I will have to answer your questions. You, I mean the condition is like calling a library function and you don't really want to add the whole library to the analysis, right? Can you somehow define what's possible there? Yes, a good question. I actually saw that here, I think, yes. This is called, it's all the same, like a black box. Yeah, if you don't anyhow limit the output of this thing then everything is possible basically and you can find some kind of bugs which don't really exist and if you look at the function it's written that it can return like 0 or minus 1 only. For example, something like that but it can return like in Max, for example. Yes, you can do that and it is quite easy to integrate it can be integrated into symbolic quite easily because it just, you can just add a model in the C program language or in the element code it somehow tells how the function behaves. So instead of the whole function you would just write short stuff and then there are some, like, assume that if this if the input is such and such then the output is going to be such and such and assume that this cannot happen and so on. So I think this model of function and adding it into appropriate directory symbolic would look it up and try to link it to the program once it finds it is in the program. You can do that at present we do not do that we do, as you said, we assume that anything is possible. Can you actually scale horizontally by adding more servers? Can you do it like on Amazon with, for instance, RAM on your servers and then we will analyze the whole new term? In principle yes because we can, during the slicing we can generate different slices that represent different parts of the program and then you can distribute the work among computers but we don't do that. But yes, in principle yes. So the algorithm allows that? Yes. Okay, I'm out of time so thank you for your attention.