 So I should start by saying I don't know terribly much about security, so I'm just going to present things from my point of view. And please feel welcome to ask questions at any time during the talk. If we diverge, then that's sort of okay. So with this talk, I'm going to first, it's in three parts. I'm first going to give an overview of coxonell, the history, what it is, and so on. Then I'm going to give some more complicated examples that I have done myself using coxonell, and then I'm going to conclude with some lessons that I've learned from those examples and in general. So I would like to somehow convey the kinds of things that one can do with coxonell and hope and encourage people to try it out and see if it can help solve relevant problems. So what is coxonell? So the idea behind coxonell is sort of find once, fix everywhere. So you're looking through your code, you see something mysterious, something looks wrong, something you just don't understand, and you want to either find more examples of the situation so that maybe you can understand it better, or you might realize it's completely incorrect and you could say to yourself, if somebody has made this problem in this particular place, they might have made it, Linux kernel's 13 million lines of code, surely someone else has had the same issue as well. So our approach to address this problem is the tool coxonell, this is an open source tool that's available in many Linux distributions as well as being available on our website. It provides two things, basically static analysis to find patterns in C code and transformations to automatic transformations to perform evolutions and bug fixes on those patterns that it finds. The really novel part about it is that it's user scriptable and it's based on the patch notation so basically you can take your fragment of code that you don't like or you don't understand, you can generalize it a little bit so that it might apply to other places in the kernel, and then you can run it on the entire kernel and get a bunch of results. So ideally this should not be too much new stuff you have to learn, you should just be able to focus on the code that you have and what you want to do with it. So it's based on the patch notation of things we call semantic patches because we raise the level of abstraction a little bit. Our goal is to be again accessible to C developers, we're not targeting sophisticated people in programming languages or logic and so on, we want to target what people who know the code and know where the problems are. So here's a very simple example, this is a patch that was submitted a number of years ago now and the subject of the patch says the exclamation point x and y strikes again. So already you can see that this is a problem that happens over and over again and so we would like to be able to automate finding the problem and fixing the problem. And even though the problem is very simple, it's just that we're missing some parentheses in a certain place, it's actually not very easy to find because what would you do if you wanted to find it? You might search for exclamation point, but there's lots of exclamation points in the kernel, most of them are well behaved and perfectly fine. You might search for and, or the, yeah, and sign, but again there are many of those in the kernel, they're perfectly well behaved and well defined. You might search for one and then grab for the other but sometimes these might not even be on the same line and anyway that combination is also okay because the whole point is the exclamation point is a Boolean operator and then there's the double and sign which is another Boolean operator. So the problem is easy to introduce, it, it typically results in some error condition that's never checked for and it's actually hard to find in practice. So in toxinol what we can do is we can take this pattern of code which is right here and we can just abstract over it a little bit. So the basic issue is we have an exclamation point followed by some expression followed by an and sign followed by some constant typically a mask that we want to bit and over the expression. So that's what we look for here, E is some arbitrary expression, C is some arbitrary constant and we want to remove the thing that looks like this and we want to add in the thing that looks like this. So that's all there is to it. And then you can run spatch, foo.coxy on your kernel and it will go around and fix the problem everywhere. Okay, so toxinol by default what it, it doesn't actually change your code. There's an option that will cause it to change your code but the default is just that it's going to generate a patch. You can look through the patch, you can see if you like it or not. You can remove the parts you don't like and then you can apply it to your code. So here's an example of a use case. This is a real example of kernel code with exactly the same spacing and new lines and so on. So we find that, we run this specification on it and we get this out the other end. So you could object that we have changed the alignment. This happens because I have removed this expression and I've added it back and when you do that then toxinol has to figure out how to pretty print it and so it does it in this way. I'm not sure that's really a negative point in this particular example. So any questions at this point? So now I'll give a little history. So this project was started in 2004 when I was on sabbatical at the École des Mines de Nantes in France. And the original idea of the project was this was about the time where Linux 2.6 had recently been released I believe. And the idea was that there were perhaps many drivers that were still targeting Linux 2.4 and we would like to port them to Linux 2.6. So that was the original goal so I spent a certain amount of time studying all of the changes that happened in Linux 2.5 and it quickly became apparent that actually porting drivers from Linux 2.4 to Linux 2.6 was going to be kind of unrealistic because a lot of things happened and they were not all obviously automatable. There seemed to be this certain kind of changes which we called collateral evolutions which are when some library function changes and all of the clients at the library have to be updated in some way. So maybe a function needs a new argument. If the new argument is just always null then that's quite easy to do but sometimes the choice of the argument depends on some information in the context. So sort of a classical example was there was USB submit herb and it got an argument which is either GFP atomic or GFP kernel and then you have to look around in the context to see whether a lock is being held or not. So that's not something you could find the problem using grep but there's no easy way to automate the solution because they need to look at context information. So then it was we developed it further over the next few years at the time I was at the University of Copenhagen. We had people, four people working on it, myself in programming languages, someone else in systems, someone in security actually. And a postdoc who later moved on to Facebook and made actually a similar tool for PHP called. So we submitted our first patch as the Linux kernel in 2007. And this took the problem of you have a K-malek, it's followed by a M-set and that can be composed into KZ-alek. Again, this is a kind of simple looking change but it's not so trivial because we don't want to turn all the K-maleks into KZ-aleks. We don't want to turn all the M-sets into KZ-aleks. It's only when you have the composition of them together. So that's what Cox and L is good for when you have some information in the local context that helps you decide what you should be doing. So we published a paper on the language at URSS in 2008. And then we published another paper which is fairly widely visible which is a false in Linux paper at Asplus in 2011. It shows some graphs from that paper. Here the idea was we took Cox and L and we used it to re-implement a bunch of rules which had previously been implemented by the group of Dawson Engler who was a people behind co-verity. What they had done is they had found a bunch of different typical kinds of bugs in Linux code such as reference under a null test. So the value, a dereference under a null test, the double lock for getting to free things and things like that. And they had applied it to the very earliest versions of Linux. And they found some trends such as that drivers were much less reliable than other parts of code and so on. So we replicated the same study over Linux 2.6. Here we have the average line, here we have device drivers. So we can see that actually device drivers, at the beginning of 2.6 they were above average, at the end of 2.6 they were below average in terms of number of faults per opportunity for the fault. So one could say perhaps the quality of device drivers is improving. Of course this is only with respect to the kinds of faults that we were looking for, there may be many other kinds of faults that are relevant as well. And this lists the kinds of faults we have. You can see the purple one and the red one are null related faults. So those seem like the most problematic ones. There's some others that occur much less often. Okay, so the current status. So I am now at INRIA, which is a research lab in France. So the project is led by myself and my colleague, Jean Millar, who's also, we're both full time researchers. So we have a certain amount of time to devote to this project. It's the concept I wanted to communicate. We have an engineer who's provided by INRIA also. And at the moment we have another young engineer who kind of does things on the side, his position will be ending soon though. So that's sort of the manpower behind the project. We occasionally have external contributors. Cox and L is entirely implemented in OCaml, which may sort of reduce our potential contributor base. But it means fewer patches that I have to review. So impact on Linux kernel, as I mentioned, our first patches were in 2007. It was not Kmalik, it was Kmalik and Memset going to KZ Allak. At the moment over 4,500 patches in Linux kernel mention Cox and L in some way. And over 3,000 from other people, so not just from our research group. So these are not necessarily people who have used Cox and L themselves to write something. There are also over 50 semantic patches inside the Linux kernel. And people can just run them on their own code. And the zero day build testing service also runs many of them on all the commits that are integrated into the Linux kernel. There's also, if you want more examples, the ones in the Linux kernel are intended to be of high quality. Do we have a whole bunch of other examples at this location, coxsonlary.org? And these are just completely random. Any patch that I send, any patch that I send, the associated semantic patch is supposed to end up there. So it was good enough for my use. It's not necessarily good enough for anyone else. But it does serve as a bunch of examples that people can use to find, get an idea of what to do. So some examples of the semantic patches that are in the Linux kernel. There's some for generic C errors. So for example, checking that unsigned variable's less being compared to zero. And a bunch of things related to null and so on. There are also generic errors that are specific to particular Linux functions such as double lock, things related to iterators. There are more SPI specific Linux errors, such as things related to the use of DevM functions. So those are kind of the bug finding rules. There's also things for modernizing code. So various APIs that people might not be aware of that older code might not be using to bring it up to date. And contributions are welcome. So you can either submit a semantic patch yourself that you would like to have in the kernel if you find some problem that other people may have. Or you can just contact us and say, this looks like it could be useful for coxsonl. And then we can try to work with you and put something together. And some people actually use coxsonl on their own and write their semantic patch and make their changes. And then they helpfully include it in the commit. And so these are some things that other people have done. So the first one is it's just taking the... We have a macro. We have its arguments. And we don't want to use the macro anymore. So we just rearrange things in some way. So this one is pretty simple. Then here's another one. And this one looks even more simple than the previous one, but actually it's not quite so simple at all. This is removing, changing the dev field of a certain structure to rename it to be parent. But dev is probably the most common field name in the entire Linux kernel. There are maybe hundreds of different structures that have this name. And so we're actually using the feature of coxsonl, which it will try to go around and find type information. And so it will only change the dev field into parent when we have a structure of the appropriate type. So this one actually reduces a lot the work of the person because they don't have to look at thousands of files that are completely irrelevant. So this is sort of the status of what people are doing these days. I hope by this talk you will see some other things you can do and maybe feel ambitious and try them out by yourself. So here we have some more complex applications. So these are things that I have done. Other people have certainly done some more complex applications as well. So I'm going to go through three such applications. Dev emification. So we have the DevM functions provide a form of limited memory management for the Linux kernel. So then we can both simplify the code, but we can also remove memory leaks because the memory freeing happens automatically. The second one is relating to blocking functions. If you have a function that blocks and you call it under a spin lock, then you might get a deadlock. So we'll try to find that situation. Last one is constification. You have a structure that has some function pointers or in general you have a structure and its fields are never modified. It's nicer for documentation to indicate that it's never modified by putting constant. It also protects the contents of the structure by causing the compiler to put it in some memory that means that no one can modify it at one time. So Dev emification. So here the idea basically the lifetime of a device driver goes from calling the probe function, which sets things up to calling the remove function which makes the device now inaccessible. And there in the setup process there are many resources that are allocated that have a lifetime all the way from the probe function to the remove function. So the idea is that we can then use the device infrastructure to remember what these resources are and remember to free the resources. We don't have to put that in the individual device driver. So here we have a probe function. I'm just considering the case of KZ alloc. There are many other resources that one can consider. We allocate some memory, things go on. If we return from function in a normal way then this memory is stored in some structure. It's going to be used on and on later in the lifetime of the device driver. But if we fail, then it's going to be free. And then in the remove function we need to free the memory as well because it's not useful anymore. So this is one way to write it. If we write it with the Dev em functions then we change the allocation to Dev em KZ alloc. And we also add an extra argument which is going to be a device argument. And this argument is going to be used to store whatever region of memory we've allocated so that the device library can manage its lifetime afterwards. Okay, so we made that change up there. We got the pdev from the argument of the probe function and we just changed the name of the function, the called function. And if you look down here, then we remove the K free on the same value and we remove the corresponding K free in the remove function. This is very important because if you just change the Dev em function, if you just change KZ alloc into Dev em KZ alloc then you've made things much worse because now you've introduced a double free because the device library is not going to know that the free happened already. Okay, so we can see how we can do this with Coxonel. So they're going to be three steps. The first one is to find the probe and remove functions. Okay, so one thing we could do is we could just use a regular expression. We could just assume that all probe functions end in the word probe and all functions all remove functions end in the word remove. Okay, so it's possible that they actually do. I don't think I've ever seen one that didn't. But it seems very unappealing somehow. So what we're going to do instead is we're going to realize that probe functions are functions that are stored in a particular field of particular kind of structure. Okay, so here I've worked on platform drivers which are very common, but there are many, there's some other kinds of drivers. You could make some more rules to collect their names as well. And the remove functions are the ones that are stored in the remove field. Typically these structures would have more fields for structure field initialization. Coxonel let you just put the relevant fields in whatever order you want as well because it's optional and there's no control and see about what the order should be. So now we've made a rule. This rule unlike the rule that we saw before doesn't actually add anything or remove anything. It just collects some information. So we have our meta variables probe fund and remove fund that contain the information we want. And we have the rule that has a name. And so then we can refer to these variables from other rules. Okay, so in the second rule what we want to do is we want to go and look for the definition of the probe function. So it's going to be here. So we can see it's an identifier and we get the name. It's going to be inherited from previous rule. I haven't thought terribly much about what the set of arguments parameters of this function would be. I'm just showing that the first one has to be a platform driver structure which is determined by the fact that the probe function is being stored in the platform driver structure by the original initialization. Maybe there's some other arguments I don't really care. Actually, I think in this case there are no other arguments, but it's not very important. So I can just put that, that, that. And then what this is doing here, here's where we're making our transformation. We are turning the call to KZ alloc into a call to dev MKZ alloc with appropriate new first argument. And I don't really know where in this function this call is going to occur. It might be on the first line, it might be later, it's not really clear. So I put these little brackets around it and this means wherever this code occurs in this function, make this change. Okay, so one thing we wanted to do was to change the KZ alloc into dev MKZ alloc. The other thing we want to do is the free. Okay, so some more notation. The free is going to happen sometime after the KZ alloc. So that's what the dot, dot, dot means here. The free is going to be using. We want the free which is freeing the thing that we allocated. So we're going to refer to the same meta variable here. There might be some other frees in between. They don't make any difference because they're not on the structure that we're allocating. And then we have this little question mark here. A question mark means it's optional. So when we have this dot, dot, dot, it means we start here and we look out everywhere we can go in the code after that because there might be ifs, there might be loops, there might be any different kinds of things in between them. And we look at all the control flow paths. Some of them might have a K free, some of them might not. One reason for not having K free in the normal execution path, there is no K free because we're going to want to use this memory later on. Another reason for not having K free is maybe the person forgot, maybe the code has a memory leak. So it's not that uncommon that this thing is missing for the wrong reasons. So this takes care of the probe function. We've changed our function call name and we've removed our K free. And now we have our remove function and in the remove function, we want to do the same thing, but we just want to remove the K free. Again, I say wherever the K free occurs, just get rid of it. K free has an argument E and this argument has to be the same expression as the one that was found by the probe function. Any questions? So I worked on this in 2012. I submitted around 40 patches. Other people have worked on this problem as well. There are hundreds of patches that relate to this issue, but there's still this semantic patch that I've just presented to you. It finds over 170 opportunities. So there's still more work to do. One could say, so now you have this automatic rule, it will just do everything. Why can you not just solve all 171 of them at once and just send all the patches? There's certain obligation to actually look at the code that you've generated. Think about it, be sure it's actually okay in every way. So it still takes a certain amount of time. What we're doing is removing the problem of searching for the opportunities. We're removing the kinds of errors that people can make whenever they change some code by hand, but it still takes some thought to actually generate patches for things. And this runs in under 30 seconds on I have an old eight core machine. We use indexing. So there's some tools in the Coxonell distribution for indexing your software. And then it will find, for example, all the files that mention K Zayalec, all the files that mention different K free and so on. And then Coxonell will only work on those files and that makes a huge improvement in performance. Okay, so the second example is related to locks and blocking functions. So I'm just going to focus on the ones that interact with user level. These functions can do page faults and that could cause a block. And so they shouldn't be called under locking functions like spin lock and so on. So I want to write a semantic patch that will detect this problem. This is a fairly complicated problem to solve. It might be that the code could just be moved and so that's not under the lock, but it might be that the locking is necessary in some way and so something much more complex has to be done. So in this case, we're just going to be finding the problem and not actually having the goal of automatically fixing it. So here's my rule. So just so it fits on the slide, I've only considered the spin lock functions. Here we have at the beginning, there's just a list. This is called a disjunction. It just makes a list of all of the different kinds of ways that you could take a lock. And here we have that, that, that, that means again that we go out and search. We start at this point and we search along all possible execution paths. And here I'm saying that there's some things that should not appear in those execution paths. If you have a lock, but then there's an unlock after it, then that's not going to be a problem. This, the blocking function is fine, but the problem comes when we have a lock and then there is no unlock and then we reach the blocking function. Here I've put a star. So in the other cases, we had minus and plus for the things that we wanted to change. Star is just for things that we want to be informed about. And star, the semantics is a little bit different than what I explained. When you have minus and plus, it actually searches along all execution paths and tries to be sure that everything is consistent. When you have star, it's more oriented towards bug finding since you don't have a fix. And it just searches for if there exists a path that gets you from a spin lock to a copy from user without releasing the lock. This is a problem that can occur within a single function and Cox and All only works on a single function at a time, but it's also a problem that even becomes more complex in an inter-procedural way. You might have one function that takes a lock and calls another function and in that other function, then you have your blocking function. Or it might, the worst case, you have a function in one file that takes a lock and then the blocking function is a function in some other file that it calls. So it's possible to do that. This is a feature called iteration. I'm not going to talk about it. If you have questions, I can discuss that later. So just the inter-procedural case, we found a few issues. The first one was related to copy from user and that was a confirmed bug. There was a lot of discussion about how it could best be fixed. Another case, it's surely a bug, but the person has already put a fix me comment in the code so we can perhaps assume that they're going to get to that someday. I don't understand your lack of confidence. So another case is a false positive. So in this case, somebody takes a lock and then they call, go on and they call some function and it's the other function that releases the lock. So this is kind of the thing where you say, okay, should I try to make my, there's actually a comment in the code that indicates that. So you can say to yourself, do you want to massively complicate your cox in a rule to address this issue? Or do you want to just try to deal with it by yourself? And so in this kind of situation, I would suggest just trying to deal with it by yourself. We're not searching for perfection, we're searching for, we're just trying to help us find things that are of interest and try to help us make the changes when those changes are relevant. And there's some other cases related to get user and put user and I don't know whether they are actual bugs or not. So the third example is constification. So we have a declaration of substructure and we just want to add the word const in this place and then none of this stuff can be modified. So this looks like an extremely simple change. It's actually not so easy to do because you have to be sure that all of the uses of the structure are also able to have the const modify on that. It is that they should not be actually updating this field. So if you just go around and modify all the structures that look like this, all the structures where you perhaps have all the initializers there, you'll probably end up with many things that don't work out and end up perhaps confused and frustrated and wasting a lot of time. So we propose a multi-step approach instead. So we can use coxsinell to find structures that only contain functions. Okay, so I don't mean to say that it's only structures that contain functions, structures that only contain functions where we should have const. What I'm doing is sort of taking a pragmatic approach. Structures that only contain functions are often always completely initialized at the beginning, so there's a good chance that things will work out. And it's also a case where it's quite important to put const because we don't want people overwriting those structures with other functions that do malicious things. So we use coxsinell to find a bunch of candidate structures. Coxsinell is going to have to go and look at the type of the structure to ensure that all of the fields are of a function type, and then it's going to report to the user the places where those structures are initialized. Then the user can look at the output and they can manually choose something that looks promising. And then we have another coxsinell rule which is extremely simple that we'll see that just updates all of the occurrences of that type with const. And then you can compile the result, let GCC figure out for you whether things are going well or not. So this is the rule that adds the const. It's actually split into two rules. With coxsinell we are good at matching things in a positive way, and it's maybe a little bit less well suited to matching things that aren't there in a negative way. So what we do here, we have the first rule, checks for structures that already, the type references that already have the word const, so those are the ones that we don't want to change. And so those are marked with what's called a position variable, so we have a meta variable which is just going to store the position of that thing. So those are positions that we don't want to change. So the second rule, it also has a position variable that says this position should be different than any of the positions that we matched in the previous rule, and in that case, we do want to add const. So I've submitted over 100 patches based on this issue. Yeah, I never tell you how many patches have been accepted because I actually just don't pay any attention to that. I submit the patches. If you don't pick them up, then the problem is on your side perhaps. None of them have been actually turned down as to my recollection, but I don't know if they've all been picked up. So we have two coxinal processes here. The detecting structures with only function fields is a bit slow. That's something you would do once. You could run it overnight perhaps, and then you can look at the results. You can pick out the structures you want to work on and constifying each one of them is extremely fast. Okay, so now I'm going to go on to lessons learned. So this is more somehow, some kinds of strategies that I would take, I have a problem, I want to think about how to solve it using coxinal. So I'll illustrate these lessons using the previous examples. So the most important lesson, really the thing I want you to take away from this talk is to start simple. So the idea of coxinal is supposed to be easy to use. It's not supposed to be intimidating. The idea, you see your code, you see a problem in your code, and you should be able to do something very similar to the problem in your code and get some useful results. If you look at the semantic patches in the Linux kernel, they might look much more complicated. So those are things where we've really kind of hardened them so that they won't get terribly many false positives so people will enjoy using them. But they're not representative of the things that you might necessarily want to start out by writing. So basically just start with a semantic patch that matches the common case. You might end up with some false positives. I encourage everyone of course to look at their results and see if they're okay before actually applying them to their code or before actually committing them anyway. You might find some problems. Maybe you have a hundred results and they're two problems. So just you found the two problems, you just ignore them and you can move on. If you find a hundred results and the first 10 out of the first 10 you find eight of them are bad, two of them are good. You might say, okay, so I would like to revise my semantic patch so that it won't give those bad answers so that I won't have so much stuff to look at afterwards. But there's a trade-off and it's always better to start simple. So we can look at the case of dev emification which we saw previously. Dev emification as I mentioned, first we have our casey alec, it turns into dev em casey alec and then we need to remove k-freeze from both the probe function and the remove function. In the probe function it's usually pretty easy. We have x equals casey alec and so we're going to want to remove the dev em of x. But in the remove function it's not really clear. The remove function just has some local variable. The developer might have called it x, they might have called it q, they might have used x for something else. We don't really know. So it seems extremely, and these are callback functions, it seems extremely hard to trace the data flow of that value x from the probe function back out to the remove function. That seems kind of hopeless. We could try to come up with some other strategy for figuring out which is how they match up with each other but as I showed before the strategy that I took was just we have some expression in the probe function and we just look search for k-free of the same expression in the remove function because Linux developers typically use the same words or letters or whatever to refer to the same kind of thing. This is completely unsafe and so on, it's just something that works fairly well. And so this is what I had written, again we have the expression probe.e. It's inheriting the expression that we got up there and using the same expression, searching for the same expression down here. So you could wonder maybe, okay so there's two things that could happen, we could have false positives, we could have it free, removing frees that it should not be removing. So then you look through your results and you see oh this is removing e but now e means something else so I don't like this case. So somehow the worst case is what we call false negatives which is when it should be removing something but it's not because the thing has a different name. So we can again use coxsignal to explore the situation a bit. We can take this rule that we have here and just so you can just focus on this line right here and we can turn expression probe e into just expression e. So now I'll just remove any k-free's whatsoever. Okay this is certainly not safe, we might have many things being allocated, they might not all actually fit the dev-m model in some way but I'm just using this for explanation to see what happens and it turns out that actually the number of files is that are affected are exactly the same. So it seems probable also each file typically only has one probe function and one remove function so it seems probable that with this rule here we're getting all the cases and now we can go through and check them as we feel motivated to do so. But again it's just, it's not sound, it's not complete, it was just to be helpful. So the next step from this idea of starting simple is incremental development. So you start with something simple but you might find that the results are unsatisfactory in some way. So as I noted the dev-m sematic, then emification sematic patch returns 171 files. So it might take some time to actually study through all those 171 files and be sure that everything is okay, be sure that the code looks nice, that's generated and so on. So you might say to yourself I don't want to look at 171 files, that's just too many. I would like rather look at fewer that are somehow known to be better behaved and there's actually a subtle issue with dev-mification which is that with respect to the releasing the dev-mification takes the k-free that you have in your code and it moves it after the end of the probe or remove the function. So previously the developer decided at what time the k-free happened and now the dev-m library is going to make that decision. It does it in a last in first out sort of way. And so in some very subtle cases it might actually be wrong. So we would much prefer perhaps to look at good things that we can just fix quickly rather than things that are a bit complicated and murky or maybe we would rather look at the complicated and murky things all at once, sort of have things classified into the easy cases and the hard cases. So we know where we should devote some effort. So what we can do is we can rewrite the semantic patch a bit make it a bit more complicated to avoid cases where there's any risk of the ordering being changed. Okay, so this is a bit complicated but the important parts are here. Here we have the KZ-alec before I allowed the KZ-alec to appear anywhere in the function. Here I'm saying there should be no function call before it and here for the k-free I'm saying there should be no function call after it. So now we're going to end up with a lot of cases where the K-free is right at the end of the probe function and when it's right at the end of the probe function then moving it past the end of the probe function won't change the ordering in which things happen. So previously we had 171 results that was maybe a lot to look at and now we have only 51. So it's a bit more manageable. The results are a bit more uniform and so the whole process will be a bit easier. So another way we might want to refine things is as I mentioned when we end up with a bunch of false positives or false positives that we for some reason just don't want to look at it can be useful to refine the semantic patch to somehow eliminate them. So this is the case of locks on the blocking functions. Coxsynol has some issues with conditionals so in general it's looking through control flow paths for things but it's not aware of values of expressions. It's just looking at the structure of the code and sometimes values of the expressions imply that certain execution paths can't actually happen. So we would like to, it can cause false positives we may like to remove those cases from our output so we don't have to look at them at least in the beginning. So here's an example of a false positive. We have basically two parts to this function. In the beginning we have a lock and we have an unlock and then later on we have the copy from user but the lock and the unlock are under conditionals and so Coxsynol thinks we can go into this branch and then we can skip over this branch or we can skip the first branch and go into this one and so on. And so we'll think that we can come down through here do the spin lock and then jump down here and do the copy from user which of course is not possible. So if we would rather not see this report then we can think about how we can refine our semantic patch so that it will disappear. So there are various ways one could do it. One could look for this special case where we have a test of some expression and then a test of the same expression. That might be a bit complicated way to do it. The way I'm going to do it is I'm just going to say make two rules. So in the first rule this is sort of like the thing I did with const before where I have one rule that finds cases that I don't like and don't want to see anymore and then I have another rule that searches for things at a different position and that actually reports the results I'm going to see. So here in the first rule what I'm saying is that if we have a spin unlock that reaches our function call then we're going to assume that everything is okay. So it's basically we can get to our copy from user in two different ways. So if there is one way which is doing it unlock then I'll say I don't want to see that one. And I will only be seeing the results where the only way to get to the thing is via a lock. Okay, so that's going to eliminate our false positive here because here we have the unlock which is reaches directly to the copy from user. Of course it might also eliminate some wheelbugs. So these locks might be different from each other. The tests might be different from each other. It might be possible to get from here down to here. You have still the semantic patch you had before. You can explore the situation in different ways. Again, it's a nice to look at a uniform set of results, do something with them and then generalize things and see more issues. Okay, so the last lesson that I've learned is that, so in general coxsinell you might write some patterns and it looks at the code and it matches those patterns and it reports things to you. But coxsinell, the pattern matching language, it's also you can mix the pattern matching language with scripts that are written using either Python or OCaml and then you can do all sorts of other things that you want. You can collect some statistics about the things that you perform that you find with the pattern and you can also look for external information outside of the code. So I'm doing this in the case of constification. Constification, making the changes easy, checking that the change is okay is kind of time consuming and tedious and time consuming. You have to compile the code. You might have, your change might affect a certain number of files. It takes a certain amount of time to be sure that everything is okay. So what we would like is for coxsinell to warn the user about what are the potentially difficult cases in general just to give the user some idea of the situation. What's the likelihood that this is actually going to be something that we want to constify? So there's some things that we might like to know are all the fields function pointers. Okay, so that's the case by definition the way we wrote the semantic patch but you might want to be more general. You might want to find structures where there exists a function pointer and there are other fields that might have other types and you might want to then know whether all the fields are function pointers or not. Another useful piece of information is are there other, so we might have many instances of a structure, some of them might be constified already, some of them might not. If other instances are already constified then it might be a good suggestion that all of the instances of the type can be constified. And then the last thing we might like to know is what compiler options or what configuration options are necessary to actually be able to compile these files that we're going to be touching if we make this change. So if we want to rely on the compiler for checking that everything is okay, then when we run make, we have to actually be compiling the files that we changed. If we're compiling for x86 and our files are only relevant to ARM then we're not going to get the information that we want. So we would like some feedback about Cox and L about these different points to help us decide what things that we would like to change and what things we might not. And so we can collect this information, as I mentioned, using the Cox and L, the Python or OCaml interfaces. So this is the output that the first phase of the constification process produces. It tells us a file in which a structure that we might like to constify occurs gives the type that we might like to make const everywhere, gives the number of instances of that type that are already have const and that don't have const. So in all these cases, there's actually only one structure which is initialized of the given type. And so it's just one or zero, it's either already constified so it won't be reported or it's not constified so we get bad of one. But there are some other structures where there's like four instances that have already got the const annotation and now there's just one that someone added afterwards and forgot to put it or something like that. And then the last information is that given the current configuration of my kernel, all of the files that define these structures are being compiled. So basically I just have a kernel on the side, I've compiled it, I run make all yes config, then I get a compile it, I get a bunch of .o files. And so it's just going to check for this file, is there a .o file? So it can do that using the Python scripting interface, just making a system call function. So we see here are different examples. In the first case, this is something we might want to work on because when we compile test the code, we're actually gonna be compiling the file that we change. This might be something we don't want to work on immediately at least because in our current configuration of Linux, we're not going to be able to compile test that we could of course do some cross compilation but that will take more time than just going through all of the x86 versions. The last one is something where all of the instances that are compiled, all of the relevant files are compiled. But this is actually, so I've done this constification stuff over a certain amount of time and build up a collection of patches that I've worked on. And so Coxinel is telling me actually, I've already submitted a patch based on this one. Someone has not applied it yet but I would probably not want to invest in working on that one again. So then I get a big file with all of these different instances and then I can decide which are the most promising ones to work on at a time. Okay, so in conclusion, so Coxinel provides a patch like language for searching and for doing transformations in an entire code base. It eases detection of project-specific issues. So the idea is you know your code, you know what functions are relevant, you know what things are important and so you can easily write rules that are specific to your particular problem. Some usage strategies start simple and refine is needed and then don't hesitate to think about whether there's other information that Coxinel could be collecting for you that would be helpful in deciding whether a change is correct or not. So we compromise on both soundness and completeness. There's nothing guaranteed about any of the results that we give. It also just depends on the quality of the rule that you write, you can write it a rule that's very simple and error prone or something that's very targeted and very reliable but it still seems to be useful in practice and this is where you can get more information. So thank you. Are there any questions? Yes? Do you deal with the situation where, for example, New York is inside of the macro? I guess you probably can detect that if you build the STP, do you modify the macro itself to remove the KT? Yeah, so the question is about the CP processor and to what extent we use it and interact with it. We pretty much ignore the CP processor for many reasons. One of them it's quite hard to find in general the definitions of macros. They're not necessarily in the same file. We don't want to parse the makefile to figure out where they're located and so on. And in general, if we were to do that, then we would have to include, process a huge amount of code that's very unlikely to be relevant. And also we, like macros, if you want to, we want people to be able to work on the code as they see it. So if there's a word false in their code, we want them to work on false and not on zero, something like that. So there are so many reasons for ignoring the macros. So you, on the other hand, highlighted the reason why we should be taking macros into account. It's possible that in the, like DevM case, people might put their KZL, like inside a macro. So we are kind of relying on the kernel developers to not do that. I think there is an option to Coxinell that will expand all the macros first before doing any processing, but that's not going to give you the result either because then it will expand all the macros and the transformation, the code that you get out will have all the macros expanded. Coxinell does look a little bit at macro definitions to know where there are returns so they can better understand the control flow. But that's all. It hasn't been much of a problem in practice, but it's kind of the things that we're missing, so I don't know how much we're missing. So there was another question. Okay, so there's the zero-day build testing service organized by Intel, and they run a certain number of the semantic patches that are on the Linux kernel. Some of the results are reported to the developer immediately, so this run on every single commit on a whole huge number of trees. I think it also starts to be run on mailing lists, things submitted to mailing lists, so sometimes the people who receive these patches have no idea where it comes from and what it applies to. Some of them come to me, it's somehow considered to be less reliable. The report comes to me. I look at it and then I forward it onto the person who's concerned. Yeah. But it uses the ones that are incorporated into the Linux kernel, so that's sort of a reduced number, so that's why I hope for more contributions because there's much more that we could be doing. Other questions? Yes? Is there any way of excluding when they've already expected so that it's a false? So we don't have anything, there's nothing specific for doing that inside Coxsonal because your code evolves over time, the line numbers change, it's kind of a mess. I mean, you can make a specific rule for doing that, but we don't. There are other tools out there that will help you with this. There's something called I-I-I-I-I-I. And the person who made that asked us to generate reports. With Coxsonal, you can make the changes or you can print out a report, and that person asked us to print out reports in a specific format so he could track the line numbers and not re-report on the same line number. And I think the zero-day build testing people do something similar as well so that it only generates new reports. But those are external tools. Coxsonal doesn't do anything in that direction. There was another question? Yes? So it works a little bit on C++, so if you want to do things on the C-like part of your C++ code, you're welcome to try it. We found some bugs in Firefox, for example. So it understands new and remove or delete or whatever. I don't know anything about C++. If you see XYZ and then angle brackets and then some words, it's a complete identifier. So templates. Right, yes, that's the idea. Coxsonal also, it parses things one code unit at a time, so if it doesn't like the first function, it'll just move on to the next one. So you'll probably still be able to do something if you stay to the C-like part of the code. Any more questions? Yes? I don't know if I want to single out a particular individual. Maybe some people in this room know who I'm talking about. Apart from one person who is sort of tone deaf to the opinions of others, no, we haven't had that problem. Okay, thank you.