 So good afternoon everyone, welcome to the talk on coxsignal, a program matching and transformation tool. Can you all hear me clearly? So I am Mangi Sarawgi from India, I am currently an undergraduate computer science student at International Institute of Information Technology, Hyderabad. I worked as a last colonel intern with the force, our trace program for women in the last summers under the guidance of my mentor, Julia Laval from India. So my project was titled coxsignal. So how many of you have heard of the word coxsignal earlier? Make this a quick show of hands. Okay. Coxsignal is basically a French word which means a bug that can eat other bugs. So what we are trying to do here is to create a semantic patch that would generate other patches solving. So coming to my work on coxsignal. My project during the summers was to add up and harden coxsignal scripts to integrate into the kernel. So the first thing that I had to do is to identify bugs that could be solved using the coxsignal. So we have a huge repository of coxsignal scripts called coxsignary and I went through the scripts to identify what kind of bugs are possibly solved by coxsignal. Subsequently, I found such cases of bugs and sent in patches solving them to know whether it's an issue of actual concern. Coxsignal scripts were developed to fix those bugs and the results produced by running them across the kernel were analyzed to ensure that they were not many false positives and also that the results were covering all the possible cases. Then patches were sent to include the coxsignal scripts and some of them have found their way into the Linux kernel. Coming to why do we need coxsignal. Bugs are unfortunately everywhere and systems code like Linux is often huge and rapidly changing. It is often written in C. Linux being a highly critical software needs to be less buggy. It's used everywhere on the servers, desktops, embedded systems. So having these bugs is definitely a plus. Coming to the Linux, we have a number of developers with very levels of experience. We have maintainers, developers of proprietary software, etc. So there might be people who are new to the code base and they might not be aware of previously existing and possible bugs and new bugs might be introduced. Coming to some programming problems. Programmers don't really understand sometimes how C works. An example is the not e1 and e2 where the precedence of not is higher than and while the intended thing might be a negation of e1 and e2. A second example is having similar APIs where we would have a common API but create another specific function for a specific use. Having a uniform function for the same purpose is always beneficial. Coming to the third case where a function call might return an error but we do not go ahead and test whether the return value is an error and use it for further processing. So there is a strong need for pervasive code changes. This is an example of the bad bit and case where we have a not undered with a constant. Here especially DMA start bit has the last bit as zero. So the result is always zero. Here you see this example. We use PCI map single. Instead of DMA map single, the only difference between them is that the first argument for PCI map single is a PCI dev while for the DMA map single it would be a pointer to the EV field. Here is an example of missing error check. Keemalock returns in null when there is insufficient memory. If we do not check whether the return value of alok is null then we would face a crash at alok dash i's. This is an example of collateral evolutions. We have a number of libraries which provide APIs to be used by clients but these libraries themselves could be changing. For example we have a function called two which is one integer argument and now it is replaced by a function bar which takes two arguments where the first one is the same. All the clients that have been calling the function four need to be modified to use the function bar where the first argument remains the same and the second argument needs to be figured out. So what is collateral evolutions significant? We have a number of libraries with a lot of clients and a lot of drivers support libraries which could be one per device type, one per bus, examples being PCI library, sound library, etc. There is a lot of device specific code calling these libraries. Some of these libraries make up more than 50% of Linux. Many evolutions and collateral evolutions keep occurring. Some examples are adding arguments, splitting data structures, adding or removing getter and setter functions, etc. So there is a requirement for automation. Some of the major requirements being the ability to abstract to irrelevant information. In this case we do not need to know what the name of the expression is. So DMA CNTRL needs to be abstracted. We need to match scattered code fragments at times. As we saw in the Kmalok case, the dereference was far from where Kmalok was being used to initialize. We should have the ability to transform code fragments like replacing the PCI map single by DMA map single and vice versa. Some of our goals with Coxsonal. The first major thing is bug finding and fixing. We need to automatically match code to find patterns of a particular bug and then automatically transform it to fix those bugs. For collateral evolutions whenever a library changes, we need to find patterns of interaction with the library and then systematically transform the interaction code. What Coxsonal can do? Coxsonal does static analysis to find patterns in C code or to mark the transformation of the code is done to fix those bugs. We can run Coxsonal patches in four modes to get different results. The first mode is the pass mode which finds the cases and makes the transformations to generate patches. The next is the context mode which can be used to mark out any positions of interest and places of bugs. The next is the all this lists in a to do format with the exact line number and position of a bug like an error or a warning. Report logs a custom message where the error or warning occurs. Coming to the Coxsonal tool, so it is used for program matching and transformation for unpreprocessed C code. We can have scripts that can be run every time a change is being made to a file to ensure those specific bugs are being introduced. A single small semantic can modify hundreds of files at thousands of code sites. What is the semantic patch based language? It's based on the syntax of patches. Semantic patch basically abstracts and generalizes patches. It's a declarative approach to transmission. It does a high level search and abstracts away from irrelevant details. Some of the irrelevant details, we do not need to worry about spacing, indentation, and comments. We can give names to statements, expressions, constants, etc. These names are often called meta variables. We can skip irrelevant code using a triple dot, which is called the control flow operator. We have variations in coding styles and these are abstracted by isomorphisms. So not y, y equal to equal to null and null equal to equal to y are all the same. And mentioning any one of them in the patch will match all the three. So it's a patch like notation with the minus and a plus for expressing transformations. How does coxsignal work? So coxsignal basically passes a control flow graph. And it passes a semantic patch after expanding the isomorphisms to get the computational tree logic. The control flow graph and the computational tree logic are subsequently matched to using a model checking algorithm. Any math is found and modified. And finally we unparse it to get the transform c code. Some examples. So finding the, I'm fixing the not x, so combining a Boolean expression with the constant is usually meaningless. Especially if the right most bit of y had been zero, the result would always be a zero. So the solution here would be to add parenthesis. Here's an example where we have a function called and with a constant called CCIPMU overflow flag. So here we need to negate the expression obtained after the and. This is a semantic patch that does that. So we have meta variables and constants. So as you can see in the second part we are removing any occurrence of not e and constant and replacing it with the corresponding statement with added parenthesis. So here we are dealing with a very special case where the y, the not x and y is a constant. We have a disjunction here because an expression we form C is likely to be intentional. Coming to the second example, we have inconsistent API usage. We do not actually need a function PCI map single that calls a DMA map single again. So instead, we would like to make a transformation by removing PCI map single and replacing it with a DMA map single. But for that we'll have to have the first argument as a pointer to a DEV field. And also replace constants of PCI type with the corresponding DMA constants. This part that does that. Here we have four meta variables of type expression which are the four arguments of the PCI map single. So PCI map single is deleted and we add the function DMA map single, which takes the first argument as the pointer to the DEV field. First of the first argument that was to making this transformation, we now have a DMA map single function instead of PCI map single functions everywhere. So we match the fourth argument and remove them and replace them with the corresponding DMA constants. This is the third argument where we are dereferencing a possible null value. As you can see SK is assigned to the SK field of TUN. And after that we are checking whether TUN is a null. So TUN may be dereferenced before the null check. So this is a semantic pass that does the transformation to remove any assigned for a null check. So we have some meta variables and correspondingly we try to find an expression which is being dereferenced and assigned to some value. And we do have a null check on that expression after that. So what we are doing here is that if we have a statement of the type TI, which is assigned with the dereference field and we skip some lines. And after that we check whether the expression is null. We move the assignment to the statement after the null check. But in that case we need to ensure that the skipped code does not use E and I anywhere. And also E is not being passed to a function as an address. So another important point to note here is E equal to equal to null will match with not E and null equal to equal to E as well due to isomorphisms. Coming to DevM functions. We have a number of functions to allocate resources. And these resources need to be freed after being allocated. Some examples are case allog, came allog, etc. So we do have a corresponding DevM version which are the managed versions and we do not need to worry about freeing the resources. So if we do not have a free we would lead to memory leaks. So using a managed interface is always a plus point. So in the proof function we will try to transform any memory allocated using a function like case allog to the corresponding managed version and remove any k-free that occur in the proof function driver. So here is a semantic pass that would do that. So the first rule is the platform rule. Which tries to find a proof function and remove function and gets their names in the identifiers probe fn and remove fn. The second rule called prb inherits the name of the proof function from the first rule. It tries to find that proof function. The prototype and it matters whether the first argument is a platform device structure. Subsequently in the body it checks whether we have any allocations using case allog. If we do have these are transformed to DevM case allog. This argument is changed to a pointer to the p-device. Depends on pr which means we will run only if we do make a transformation from a case allog to a DevM case allog. In this rule we inherit the remove fn name from the platform and remove any k-free that occur in the body of the remove function. This is an example of collateral evolution. So the scsi get and scsi put functions were dropped from the scsi library. This required collateral evolution in all the proc info callback functions where we had to add a new parameter and remove any declarations of scsi body where we were using function calls to scsi get and scsi put. Also if we add the parameters in argument we do not need to check whether it's null in the function body. Here is the semantic patch that helps to do this. We have a meta variable for the function called a proc info and identifiers and why. We do away with all the declarations of an scsi the calls to get and put and null checking in the function body and add the argument scsi star y to the function. Running this patch gives us one of these transformations. So here what happens is in the function is 53C700 info instead of having just one argument limit we have two parameters limit on scsi star sc. In the function body we do not have the declaration scsi star sc and the calls to get and put functions and also we skip over the null checking. The remaining code remains as it is. How do scripts in look like? As we had mentioned that we have four modes. So till now we were running in the patch mode itself but we could have passes run of the four modes. So we need to mention that we have four modes patch, context and report. Subsequently we need to have rules for all the four modes. So this script basically removes any expression of the type if e bug and replaces it with a bug on e. Bug on is basically a macro and it's more uniform and better to use it. So the first rule depends on context and it will run if we are running the patch in the context mode. So it marks out any positions where we have a if expression bug so we use a star operator for that. Then we have the normal patch mode which will delete lines with if expression bug and add a bug on expression. The next is the org and report mode. Org and report mode needs to find a position where that error or warning or that position of interest occurs. So we use a position variable position P. Using the statement if expression with an utterate P gives us the position of that particular statement in the variable P. Subsequently we can write a Python script that gets the value of the position P as an array of structures. Those structures would have a file name, line number, column number, et cetera. So after getting the value of position as a array of structures, we print the positions in a to do format for org mode and in the with a custom message in the report mode. Thanks to remember we're using coxswain. The semantic patches might have multiple rules and these rules are always applied file by file in the same order as the application of a star just to make out where the changes are likely to occur without actually making any transformations in the context mode as we saw. Positions can be marked and any relevant information such as line number on the variable names can be printed as messages. So another use would be not just to make transformations but to make searches as well. As we saw in the pro function, remove function, if we need to find a function which is a pro function, we can list out all the pro functions that ever occur using just this rule running over the entire driver directory. So we can check whether the syntax of the script written is right using the parse coxc option. There are some imperfections with coxswain. Whenever we are parsing the C file, we need to expand all the header files that are included which might be interleaved and lead to a huge amount of code to be parsed. But if we do not need any macro expansions or any things that might be included in the header files, we can, the expansion of the header files using the option, no include headers. The issue is pretty, the transformations that are made do not follow the 80 character rule and there might be some extra spaces, et cetera, that need to be solved multiple times. If there is a small error in the script, it is not very easy to debug that. As running the parse coxc or trying to make the transformations would give you an error message that is not very informative. So here we can see that we get an error message called line six column two. So that position might have the bug, but that's also not very accurate at times. To the conclusion, coxswain is a part of the programming mapping and transformation tool. We have over 450 parses made using coxswain into Linux kernel. Nine coxswain parses are in the Linux kernel itself in the coxswain slash scripts directory and the coxy tech for running them on the whole kernel, a subdirectory or files with unconnected changes. The coxswain semantic parses look like the normal parses and fit very comfortably with the system programs. They're very easy and are widely accepted by the Linux community. Probable bugs have been found in other open source repositories using C, examples include GCC, VLC, Vine, et cetera. Thank you, any questions? At this point, if anyone would like any questions to ask for Margie? I'll call for any questions out there. Oh, I'm sorry. Next one. Are the patches reversible? Yes, so we have an option to make the changes. So we can add an in place to actually make the changes or we do not use the in place. We can actually have all the patches written to a file instead of being in the code will remain as it is and we'll just have all the patches. Okay, my second question was early on you had a replacement of a constant. Yes. Yeah. Now back a bit further, I think. Okay. It was right at the beginning. I forgot which one it was. Yeah. You already pattern matched a fair bit and you said now that this should be this constant, please replace it with that constant. Okay. What happens if you don't find the or PCI-DMA ones, yeah. Yeah, I didn't get you. So you had replaced this constant the other one after all this pattern matching is done. Yes. And I assume one of the initial constant had to be there. That one, yeah, what happens if it's not there? Okay. If we do not replace the PCI-DMA constants with the DMA constants. Okay. Actually nothing would happen but we are using the DMA function call so having a uniform use of constants would also be desirable. Any other questions? Okay. I was just gonna ask if you had your slides up on your website. Yes. You've been told. I came in right at the end and I wanted to see it. Is Costinell able to say transform variable names? Yes. So in terms of in a pattern like way rather than explicitly. Yes. As in we can find. You want to extract off and remove the prefix from a class of variable names. Yes, we can do that. So we can find a particular variable name using a particular rule and we can mark out all positions of those variables. Once we get the positions we can replace those positions with the new variable name. So that is possible. That's pretty possible. As in we can replace a function name so we can obviously replace a variable name as well. Last chance. Roll call for any questions? We've got a little bit of time left. No? Okay. Well on that note I would like to thank Hamunji for her speech and as on a behalf of the LCA team please take this gift. Oh yes. And could you give a big thank you to Hamunji for his speech please.