 So welcome everyone, thanks for your interest in Coxinel. So this talk, I mean it's going to be a talk, but there's also going to be the opportunity for you to participate. We have a virtual machine which is running a Coxinel that you can use and there's going to be some exercises so that you can try out the system. So the virtual machine takes a little bit of time to start up. So what I suggest is to take a few seconds and just click on the link right now and then it will be already when we get to the exercise. So you should have the link in the chat window. Okay, so the topic of our work, what we're interested in is the fact that there are many common problems that arise in programming. So in particular in programming C code, in programming the Linux kernel. So basically, the Linux kernel is like a really huge code base and there are many different kinds of issues that can arise in developing such a huge code base. So one of these issues is that programmers don't always think about how C actually works. So here we have an example, oops, okay, I'm sorry. You see my, does one see my mouse when I move it around? Kristen? No. I can see it. You can see it? Okay. So here is an example. This is something that I found when looking in some recent kernel or some recently proposed patches. So somebody had written if unlikely and then some expression is greater than zero. So unlikely is a construct that is going to indicate branch prediction. So unlikely means that the developer has expected that this expression is very unlikely to be true and so branch prediction should choose the false branch as the most likely code to occur. But actually in this particular case, if one looks carefully one can see that the argument of unlikely is not actually the entire condition, it's just this difference between these two pointers. And the difference between the two pointers is not going to be true or false. It's just going to be some value. And so the unlikely doesn't make any sense at all. So this is not actually going to cause the code to crash, but it's on the one hand, it's not being useful for anything. And on the other hand, it is kind of misleading to developers who might expect that this is to be working in sort of a normal way. But on the other hand, this kind of problem is going to be quite hard to find because what can we grip for? Okay. So we can grip for unlikely, but there are 19,000 of them in the Linux kernel today. So looking through all the 19,000 of them to find ones where the parenthesis is on a sub expression instead of on a full expression is going to be very complicated. So that's one issue. The next issue is the case of APIs. The Linux kernel code base is very big and it's evolving all the time. And so it often occurs that there are several functions that have kind of the same behavior. Often this arises because some developer has had an idea of a better way to do something. And so then they introduce a new function that does it in this better way, but actually not all of the old uses get updated. And this problem is something that can only only gets worse over time because other developers copy those old uses. And so we get actually over time, we can get more uses of the undesirable code. And the sometimes even the new clever API doesn't take off in the way that one would hope. So we end up with a code base where we use different functions in different places for doing very similar thing, which can be very confusing. And the final issue is that many functions can fail. For example, if you try to allocate memory, then there might not be enough memory available, in which case it will fail. Some of these failures are extremely rare, so it's hard to find them in testing. Again, it's something that's very difficult to find just in searching to the code base. If you search for an allocation function, you'll just find the lines that contain use of the allocation function. The relationship that we're interested in is like we have an allocation function and then we don't actually test for failure before using the allocated value. And so that involves multiple lines of code, which are not necessarily next to each other and so on. So it makes it very difficult to find. So all of these issues, which are very common in working on a code base to size the Linux kernel, involve the need for pervasive code changes. So here we can look at our examples in more detail. This one I've mostly explained in sufficient detail already. This is the patch that was performed recently. And we are just changing this unlikely, which is on a sub-expression of the test so that it takes into account the entire expression. So here the developer actually did two things at the same time. So the change from greater than zero to not equals is not really the point that I want to make, but the point to make is just that unlikely is supposed to take up the entire if expression so it can be a choice, indicate a choice between the two branches. Here's an example of inconsistent API usage. There are two functions for accessing memory, DMA memory. So this is memory that can be accessed directly both by the devices and by the CPU. So there's one function which is called CPCI map single, which sets up the region of memory which is usable in this way. And we have another function called DMA map single. This is also setting up this kind of interaction. These functions basically do the same thing. The PCI map single is a wrapper for DMA map single. Here we have DMA map single in what seems like a PCI related file. So it's just kind of chaotic. Some parts of the code use one function, some parts of the code use another function. And so it's not really clear why and so on. So it would be better to just use only one kind of function everywhere so that people don't have to learn about so many different things. And so one can understand the relationships between different parts of the code. Next example is a allocation and a error checking code. Here we have a call to KZE Alec, which is one of the normal ways of allocating memory in the Linux kernel. And then a bunch of lines later, we have a dereference of the allocated memory. But there is no error test. There's no null test along the way to see if this function has failed. So maybe almost all of the time, the function will not fail. But someday one will be in a low memory situation and then there could be a problem. So basically, we want to consider all of these kinds of issues. We want a approach, a tool. Basically, something that is common to all of these issues is that they're very hard to find. If you can see them easily when you're just looking at the example that I've shown on the screen, but if you're confronted with 19 million lines of code, then it's much harder to find all of the places where these problems occur. So what we'd like is a tool that will help to automatically find code that contains these kinds of bugs or defects or what we call collateral evolutions. Collateral evolutions are cases where some function has evolved. Maybe the function takes some new arguments or should be used in a new context. And then all of the uses of that function have to evolve in the same way. So we want to find places in the code base, in large code base that requires addressing these issues. And then ideally, we would like to actually fix the issues. So one option is just to report to the user, you should be looking at this code and try to see what should be done here. But another better option is if we can actually automate the fixing, because in general, whenever you let humans touch some code, then something can go wrong. And finally, what we would like to do is we would like to provide a tool which is accessible to software developers. So something that anyone can use. So you'll have the chance to use it later on in this talk, something that's easy to relate to the experience of software developers. Because if you think about Linux kernel developers, in order to be able to work on Linux kernel, you have to have a great deal of knowledge about the Linux kernel itself, all of the invariants that are respected by different functions, how you're supposed to use different functions in what places and what ways and so on. And what we would like is a tool that makes it easy for developers to express this kind of knowledge in terms of how they should find code containing problems and how they should fix it. So before we introduce the tool, we're going to think about what are the requirements for this kind of automation. So one requirement is the ability to abstract over irrelevant information. So if we think about, for example, the case of the unlikely, then there's some parts of this code which are important with respect to this problem and some parts of the code that are not important. So clearly the word unlikely is going to be important. And the fact that the unlikely is nested in some kind of other expression is also going to be important. On the other hand, the argument of the unlikely is not important. The variable aligned or the field pointer or the fact that we're doing a subtraction, none of this matters. The only thing that matters is if we have an unlikely and then it has some argument, we don't care what it is. And then it's used with greater than or equals or not equals or all kinds of other things as opposed to being just directly being used as the test expression. So basically, if we want to find some problems and fix them, there's some fragments, kinds of fragments of code that we're going to want to search for and some fragments of code that we want to abstract over because we don't care about those other kinds of code. So that's the first requirement. The second requirement is the ability to match scattered code fragments. So if you think back to the KZE example, basically we had a KZE alloc and then we wanted to search down in the code to see if there was a dereference of the allocated thing. So the dereference might come on the next line. It's actually often common just to allocate something and then to start using it directly. But it might come 10 lines later. It might come 100 lines later. We don't really know how far away it's going to be. So we want to express in general though that first we have a KZE alloc and that the result is going to be assigned to some variable and then we're going to be using that variable somewhere afterwards in the execution. And then the third requirement is the ability to transform code fragments. So for example, in the PCI map single case, we want to change all the PCI map singles to DMA map singles. And we're going to see that there's kind of a very systematic way that we can do that. So we can make basically a rule, which is going to describe how to turn every call to PCI map single into a call to DMA map single. So then in principle, we can just run the rule on the code base and then get rid of the problem all at once. I say in principle, because in practice, you have to actually verify that the code that you have generated is actually correct. You might want to be sure that you can compile all of the code that you have changed if you would like to submit this so it can be in the main line. You have to create your patches and so on. So there's a lot of other things involved in actually getting changes into the Linux kernel. But at least we are going to automate the first part, which is actually making the change. So these are the kinds of issues that Coxinal is targeting. So what is Coxinal? Coxinal is a program matching and transformation. It provides program matching and transformation for unpreprocessed C code. So I would like to highlight this, which is a bit of a technical issue. But in C, you know that we have things like macros. We have things like if depths. And you run the C preprocessor. And then the code gets expanded and modified in all kinds of strange ways. What Coxinal is going to offer that we don't run the preprocessor, Coxinal offers that the code that the developer sees on the screen in front of him is going to be exactly the same as the code that is going to be processed by the rules. So you can reason about the code exactly in the way you see it. And the code that is generated is going to look exactly like the code that you started out with, except in the places that you actually wanted to have changed. So our idea, our goal was to fit with the existing habits of C programmers. And our idea to address this issue is to focus on the idea of a patch. So a patch is a way of expressing changes. Basically, a patch is what you see when you do diff, or it's what you see when you do Git log in your version control system. It is basically fragments of code. So developers are certainly familiar with their fragments of code. And then, since it's describing a transformation that should happen, it indicates some line should be removed with a minus at the beginning of the line, and some line should be added with a plus at the beginning of the line. But since we have ordinary C code with these annotations, it's very easy for a developer to understand what's going on and to relate it to the actual code that's going to be affected by the transformation. So what we do, the problem with patches is that they're very specific to a particular file, particular lines performed after the change, and so on. What we do with coxonell, it offers what we call the semantic patch language, or simple. And this language takes the patch notation and abstracts over it a little bit. In particular, we have meta variables for abstracting over subterms. So if we go back to the requirements that I listed before, I said there were some things that are important, some things that are not important. So the things that are important, you can just write in the normal way, like in a normal patch. The things that are not important, you can indicate some information about them using meta variables. We also have dot, dot, dot for abstracting over code sequences. So this goes to the k, z, alec, and then the dereference example. Basically, if you were going to explain to someone else about the problem, perhaps like got a piece of scratch paper, if you're just talking to someone in an informal situation, you might say, first we have kx equals kz alec, and then we have some stuff, so dot, dot, dot, and then x or y is whatever. So the idea of dot, dot, dot is very familiar to everyone as a way of just saying some stuff, although actually in this language, it has a very precise meaning, which is that the things before the dot, dot, dot should be executed, and then the execution should proceed over the dot, dot, dot stuff to the things that come afterwards. And finally, we have, again, exactly the same patch-like notation with minus to remove things plus to add things to express transformations. So here's a concrete example. I call it simple, simple semantic patch language. This is a semantic patch for the unlikely problem. So basically our problem, the important aspects of our program problem is that we have an unlikely, and then we have an expression, and this expression is not being just directly the test of an if statement, it is being part of another comparison. So when we have it as part of this other sub-expression, then we want to enlarge the scope of the sub-expression so that we test unlikely of the whole thing. And as I noted before, we don't really care about what is the argument to unlikely or what is the value that we're comparing with. And so in the semantic patch language, you can replace those things by meta variables. So the meta variables are being declared here. In a normal patch, we have the form at, at, and then some line number information, and then another at, at, and then the transformation below it. And so in the context of a semantic patch, we have a kind of similar form. Here we have the, we have instead of the line numbers, we have the meta variable information, and below that after the at, at, at, at, we have the transformation specification. So this is one rule. You can take this rule and you can apply this rule to the entire Linux kernel, and it will perhaps find something. And a semantic patch contain multiple rules if you have different things to do. So just a little bit more detail about the meta variables and the transformation. As I mentioned, the meta variables are declared at the beginning at the at, at, at, at, which is used for the line number in a normal patch. And you can say that a meta variable has to match any expression or any statement or any type, or there's a whole bunch of different kinds of meta variables. So an expression is something like three plus four, something that returns a value to its context. A statement is something like an if or break and so on like that. It doesn't actually return any values. Type would be something like int or float or whatever different kinds of types. You can also mention just a type from the source program. So you can say a meta variable has to be always represented integer, for example, or always has to represent a pointer. And there's some other kind of specific type meta variable types, iterator, declare a name, type def and so on. It's not going to be relevant for this talk, but if you look back at this talk for reference, then you can find them. So basically there's a bunch of terms in the pattern that you're interested in and you're not interested in saying the specific details about them, but you do know what kind of term they should be. And so then you can declare that as a meta variable and then it will match any occurrence in the code base. And then we have the transformation specification. In the transformation, one option is that you can either remove things or add things. If you want to remove something, then just like in a standard patch, you put a minus in the leftmost column. If you want to add something, then you put a plus in the leftmost column. So the notation is designed to look like a standard patch, but in the way it's actually interpreted, there's some differences. When you put a minus, for example, in the leftmost column, it means that all of the words on that line are going to get removed. But those words, when they match the C code, they can match things on different lines. So actually the whole matching process is not sensitive to new lines and spaces and so on like it is in a real patch. Here, putting the minus in the left-hand side is just sort of an abbreviation for saying, I want to remove all of the tokens that appear on this line. Besides removing and adding, there's also the possibility just to highlight things. Sometimes you want to search for something, but you don't know actually what you want to do when you find it. Maybe you're going to have to, you have a general problem, but in the way it should be fixed in each case might be different. So in that case, instead of saying, I want to remove this or I want to add this, you can just put a star and that will just highlight something. So if you want to find, for example, all the calls to KZAlloc in a file or in a code base, then you can write a pattern with a KZAlloc and you can put a star next to it and then it will produce a output that indicates where all of those calls are. And then as I mentioned, the spaces and the new lines are irrelevant. Cox and L is going to keep all of the spacing and new lines as they were in the file. So in the generated file, you might ask for some changes. You might ask for maybe three to be changed to 15, but the only thing that's going to happen is three is going to get changed to 15. All the comments, spaces and new lines will all be preserved. Okay, so now you have seen a, it seems like maybe a very short overview, but these are actually all of the main ideas of Cox and L. And so it is time to do exercise. So I have several exercises here and I will go over them. Given this video kind of impersonal format, I'm not going to give a lot of time for the exercises, but I hope you will at least get to do the first one and then the second one is kind of just bonus for those who have time. So the first exercise is unlikely. This time we're looking for not equals instead of less than. And so basically the idea is just to take this semantic patch, you can type it in and you can, I've given you here the instructions to run it. So in the virtual machine, you should have spatch available and you should have a Linux kernel available. And so you should be able to run these commands directly. So the first one is to do that. The second one is to make a rewriting. So here we have one kind of function call, Devset drive data and then some arguments. And we want to turn it into another kind of function call with some arguments. And you see the arguments in the generated function call are derived from the arguments in the original one. So here I've suggested how you could start your semantic patch and then you can just fill in the transformation here. And then I've suggested a place where you can run it and you can see what's going on. There are a few interesting, just such the coxinal has a lot of command line options. These are some of the more interesting ones. So to run coxinal, it's called spatch. You have your semantic patch, which ends in that coxie. And so you can do spatch and then parse coxie just to be sure that your semantic patch is syntactically correct. You can run a semantic patch on a single file or you can run it on a complete directory. If you run it on a complete directory, it will work on individual files one at a time. If you want to understand why your semantic patch didn't work, probably the actually you'll get a parse, if you make some mistake, you'll probably get a parse error. But sometimes it's useful to give the minus minus debug option and that will give some other information about the processing of your semantic patch. And the last issue here, it's not very relevant for this talk, but coxinal allows you to configure how many header files are actually included in the execution process, including more header files, gives more information, for example, about type definitions, about other function definitions and so on, but it also takes more time. Normally, spatch is going to just generate a patch. It's not actually going to change your file. It generates a patch on the standard output and it prints a bunch of other information on the standard error. So when you try it, you might like to just redirect the file, the standard output into a file and then you can study it afterwards. Or you can also use the very quiet option and then it will kind of eliminate a lot of the kind of information that it prints along the way. And if you have a multi-core machine, then you can use run spatch in parallel and then it will treat different files at multiple files at the same time. So in the context of this talk with a virtual machine, each person on the virtual machine has only one core. So this is not a relevant option. So what I suggest now is to take about five for so minutes and try to do the first exercise. And I can answer any questions that you have. If you have any problems with the exercise, just raise your hand or type in the chat or do whatever is convenient. Julia, it looks like you have a question in the chat or any questions now. Sure. How to see the chat. Sure. If you click on the chat when... Bubble on the bottom, I found it. Okay. There you go. Oh, sorry. It has nothing to do. So somebody asked, it's a very good question. Why, what is the ISO file parameter? What's it doing on the ISO image? It has nothing to do with ISO, the big ISO thing. It's a thing that I hoped you would just type in, be happy to type in and not have any discussion. But it, we'll get to see it later. Let's just say it that way. It's a thing called isomorphisms. It has to do with transforming rewriting the semantic patch in a certain way. And if we don't have this option, then the unlikely will disappear in the semantic patch. Somebody asked me, I think meta variable is also a function with sub-functions. So, Jayanka Johnson, if you would, could you mute and... Sorry, unmute and ask your question because I'm not sure to understand what you mean. There's no question. Hello, I'm sorry. I can't hear you very well. I'm sorry, okay, that doesn't work very well. Maybe can you write in the chat a little more information about your question? Because I can't hear you well enough. I'm sorry. So, I have a small problem with the chat, which is that it disappears all the time. I can't see it. Okay, that's fine. I'm just going to show you. You'll have to read me anything it says. Okay, no problem, I can do that for you. Okay. So, I'll try to answer the question. Meta variable, it's just like a little tiny expression or a little tiny term that's going to replace some bigger term. So, it's not actually a function. It's just a thing that replaces some other things. So, we have here an example. Unlikely E1, and so it's unlikely E1, and so it's unlikely of any possible E1. But maybe the question was asking, can unlikely, like unlikely here as a function, could unlikely be a meta variable also? And yes, anything you want can be a meta variable. There's actually, you can generalize this here. There is a binary operator type of meta variable. So, instead of putting explicitly not equals, you could have like binary operator and E, and then you can say unlikely E1, and then just and E and E2, and then you can maybe find some more issues. Okay, Julia, we have a few questions in the Q&A box that I can read to you if you'd like. Okay, maybe I can, I see two, let me, this, oh, okay. So, okay, I see many, no, no, things is working better. Okay, if I, yeah, it seems that if I turn on Q&A, then I get both the chat and the Q&A. Okay, so many people have some questions about many things. Why are so many files skipped? And what does skipping mean? And what does handling mean? What's going on? So those are good questions. So basically, I suggested that you run the Semitic Patch on some directory, and there's some files that are relevant and some files that are not relevant. And basically, whenever Cuxinel-Spatch interacts with a file in order for it to be able to do something with a file, it needs to parse the file. Actually, it also produces a control flow graph. And so that is a lot of work. And so if you have three files or 10 files, you might not notice that. In the case of Linux kernel, we have like 19,000 files or I'm not sure how many files. We have 19 million lines of code, maybe 40,000 files actually. And so it's not very interesting to actually parse all of them and make control flow graphs for all of them. So Cuxinel thinks about it takes your Semitic Patch and it analyzes it a bit and it tries to find what are the really important words in the Semitic Patch. So unlikely, for example, is an important word. And so actually basically what it does is just a graph over your code base to see which files have unlikely in them. And so it is handling the files that contain unlikely and it's skipping the files that don't contain unlikely. But again, just because there's an unlikely in the file doesn't mean that there's actually anything to do. So you won't necessarily get an output for every file that it's handling. But the one it's skipping, it's just ignoring completely. So there's another question about does Patch require root? The container has a root prompt. Yes, we just put the container together very quickly. Patch absolutely does not require root. So that's not a problem. Somebody asks why are so many files skipped and that goes back to what I just said. Unlikely probably doesn't occur in that many files and so most of them are skipped. And somebody asks, is there a repository of bug patterns that I should run over my own code base? So if your code base is the Linux kernel, then inside the Linux kernel there's a bunch of, there's about 60 semantic patches that you can run. And they are all set up in a certain way so that there is a make target in the Linux kernel which is called make coxie check. And then you can just automatically run all of the semantic patches in the Linux kernel over, for example, the file that you have changed or a subdirectory of the Linux kernel or the entire Linux kernel and so on. But there is also, I think if you go to the coxsignal website, which, yes, so this is the ones that are in the Linux kernel. If you go to the coxsignal website where the link is at the end of my talk, or you can just search for coxsignal and you'll get a bunch of handbags, but you can try to find the right one with appropriate search options. There is a coxie check tool and on the webpage there's some scripts that are maybe of a bit more generic interest. There's also a site which is called coxsignalery. Yes, thank you, Karthik. There you go. There you can find many semantic patches. They are just like all of the semantic patches up that were ever written somehow or were ever somehow useful for the Linux kernel up to some date and time. There's no real guarantee of quality. They were just, we wanted to get a bunch of examples out to people. And so you can also take all of those examples and run them on your code base. Some of them are very specific to the Linux kernel and some of them are just much more generic to C code. So let me just be sure. There's more messages. Okay, somebody is asking about C++. So we have a minus minus C++ option and it will do something for C++. I don't promise it will do everything. In general coxsignal tries to, it has a parser which is very friendly or tries to be friendly. So we have a big file and maybe some functions are difficult to parse and some functions are not so difficult to parse. And so when it fails on some functions, it just goes along to the next ones. And so this is advantageous for C++ support because maybe some parts of the C++ code we're not able to parse, but it will still keep going and try to do things in other parts of the code. If you are interested, we are somewhat interested in better supporting C++. And so if you have some particular needs, then you can write to the mailing list and we can try to address those needs. So somebody has a concern about make-toxie check with mode patch which fails sometimes. It's not supposed to do that. It's supposed to move on. And so I think we made effort to improve that situation. Does it fail like fail, fail or does it fail and move on to the next rule? I see. Okay, I'll try to look into it again. So Kristen, I need your help again. Or the last comment I saw was from Emil about make-toxie check. Is there anything, any more concerns after that? There's nothing else. Okay, great. I'll caught up. Okay. Okay, so that's, I hope everyone had a chance to at least work on the first exercise and to get a good idea of what how Cox now works in practice. All you have to do is Cox now is available in lots of Linux distributions. And so you can just, if you're using Linux or it's also available from Mac and probably in some way on Windows, you can just install it in the use whatever way you're used to installing things and then it will just work in exactly the same way as working for you in the perpetual machine. Okay, so now we're going to look at our second example. Julia, we have one more question. Sorry. Please go ahead. When it fails, does it fail like error report or fail like logical report or logical error? I'm not sure to understand the question. Cox now doesn't normally, I mean, Cox now might fail because you have a semantic patch that has a some syntax error in it, in which case it will fail. On the other hand, your semantic patch might process some files and it might not actually do anything, but that's not considered to be a failure. Your semantic patch might actually not do anything on any files. And that's also not considered to be a failure. So it doesn't fail very much. I don't know if that answers the question. I think we're good there. Thank you, Julia. Okay. So the next example we're going to look at is inconsistent API usage. So we'll go back to the PCI map single and DMA map single example. So this is the definition of PCI map single. And if you look at it, it takes a bunch of arguments, HWDF pointer, size and direction. And if we look at the DMA map single call, basically here, if we ignore the first one, we have pointer and size and then basically direction. It's casted to something else, but it doesn't actually change the value. So basically we have the same number of characters in the names. We have almost the same arguments. The only difference here is in the first argument and it's being careful about nulls and it is accessing, if the value is not null, then we want to access one of the fields instead of passing in the whole value. So the difference between the functions is very, very small. If we forget about the possibility that HWDF might be null, then we can see that maybe there's actually no real point to have this wrapper function. So that's what we want to do. Basically we want to take code that looks like this. We have PCI map single, then we have some random arguments and then we want to just turn it into a call to DMA map single. Basically the main change we need to do is we need to take whatever is the first argument here and put an and sign and arrow div kind of around it. And then there's this interesting bit at the end. We have these constants. PCI map single often has names that look like this. PCI DMA to device from device and so on. And DMA map single has another set of constants. What's kind of funny is that the ones for DMA have an underscore in a different place. So it can even be kind of hard to remember, are we doing the constants without the underscore or the constants with the underscore? So it's kind of easy to see that we could somehow do better by just getting rid of all of these variations. So I'm sorry about the black box. So basically there are a few things we need to do. We need to change the function name. We need to add a field access at this place and we need to rename the fourth argument. So to see how we can create our semantic patch, one idea for general strategy for creating semantic patch, you can take a existing patch and try to make it more general so it will apply more widely across the code base. So this is a change the real person has done at some time in the past. What's kind of amusing about this change is that person did the change of function name properly and they did the change the first argument properly, but actually they forgot about the last part and so they left the wrong constant here. Of course, this is just a problem from a point of view of reading the code. The actual constants are just lined up exactly, but as I mentioned, they are not exactly spelled in the same way because there's some extra underscores in the DMA situation. So basically we want to write a wall so that we can make this change everywhere in the code base. So we can start like this. Basically the idea of what I all have done here is to turn it into like an outline of a semantic patch. I have just put the add add stuff at the beginning and then I just copied the patch as it is. And this is actually a valid semantic patch. You could say spatch and then this file and then across your code base. Probably not very much what happened though because it's kind of over specific. It's got some variable names here which are not very important for the rule and that we want to make more general. So the first step is going to be just to drop the code that we don't care about at all. So actually we only care about calls to this function and its arguments. We don't care about this part over here on the left hand side. So this PCI map single can be called on the right hand side of some assignment. It can be called an if. It can be called in many different return in many different contexts. So we don't really care about the context. We just care about changing the function name and adjusting the arguments accordingly. So the first step will be just to drop the part that we don't care about so we can drop the assignment part. So now we're just focused on the function call. So the next step is to look around like we did with unlikely to see which are the expressions that are important and which are the expressions that are not important. So you can imagine that for example, this expression here is not important because it's in the old code. It's in the new code. It's not changing. It's referring to some very specific variables and so on. So we can certainly make a meta variable here and we can make another meta variable here in this case. It looks like we could just make meta variable here but that's because of an error that the person made. So we'll just not do that yet. And then there's another place that we can also make a meta variable here even though this is in the transformed code actually adapter pdev is the first argument originally and then adapter pdev is used exactly the same way but it's kind of nested in some other code. So we can also have a meta variable in this case. And so we get this. We have three meta variables e1, e2, e3. Each one is some different expression. And then our change is going to be we can replace PCI map single by DMA map single. We can adjust the first argument so that it has the and and the arrow dev around it. We leave the second argument and the third argument the same. And then now we're going to actually need to do something about this fourth argument. So we don't want to just leave the fourth argument being PCI DMA from device. We want to change it so we can remove PCI DMA from device and we can add DMA from device like it's supposed to be. So now we have a rule that it's quite generic. It's not, for example, specific to what are the first argument secretary and so on. It's not specific to usage context. It can apply to any call to PCI map single. And so we can take this rule and we can use spatch and we can run this on the entire code base. And we find that it makes 17 changes. So 17 changes is pretty good. That's a lot of work that you don't have to do. But unfortunately we also find that 43 other calls remain. So we didn't get very far. So actually, but I mean, a little bit of thought shows that this is not surprising at all. Since we have these specific constants, we need to have a special rule for each of these constants. So this here we're illustrating PCI DMA from device. We'll need another rule for PCI DMA to device, PCI DMA bi-directional and PCI DMA, I don't remember what it was, DMA none or something like that. So basically you can just take this one rule that we developed in this case. Here we had to do some thought and you can make four copies of the rule and you can say we can replace this constant by this constant, this constant by this constant and so on. So if you think about like your ordinary programming, you might feel a bit uneasy. Like we don't want to copy, copying code and make little adjustments to it. In general, it's not a very good practice because you'd like to maintain your code over time. You would like to read your code over time. If you just copy paste, you're not really sure what's the same, what's different and so on. So maybe that's something that applies to ordinary programming. For in the case of a Cox and Elsemantic patch, maybe it's not such a serious consideration. It depends on what your goals are. Basically we have a finite number of calls to PCI map single and once you have run your rules over the code and at least once you have gone through all the subsequent work to submit your changes to the Linux kernel, then you're never going to want this rule again. And so the idea is that rules should be very easy to write and so you might also, many of them also, you might just like to throw away immediately. You just use them, they give you some information and you can move on. So in some sense this copy space thing, this code duplication, maybe it's not all that big a problem. But we will look at some other ways that we can write a rule that have different properties. So this is another option. If we want to avoid the code duplication, then we have a way of giving a bunch of different transformation options. So we use something which is called a disjunction. So the idea is to be like a regular expression. So we have one option, we have another option, we have a bunch of different options and they are separated by these vertical bars. And then we use parentheses to indicate between the beginning and the end of the different options. So the very important thing is that just like minus and plus should always appear in the left most column, the information about being a disjunction should also appear in the left most column. So that makes kind of a problem because parentheses are probably the most common thing in any programming language and we are using a parentheses. So you have to be very careful with your parentheses. Parentheses that are in your code need to be indented, at least a little bit. For open parentheses after function calls, they naturally come right after the function call, so it's not something you have to think about. But you see here the closed parenthesis is a little bit indented. It's because this closed parenthesis matches up with the one up here. These ones here are in column zero, so they are being parenthesis for a disjunction. But basically the idea is that we can put whatever is the common part and then we have this fourth argument where there's a bunch of different options and so then we can use a disjunction to enumerate the different changes. So this makes a, we have just one rule and it takes care of all four situations. But then you might feel like maybe this is not really an ideal rule because it's like this part here was very important. No, it's kind of hidden. And so we have this huge disjunction with all these different options and so it might feel a little bit, maybe a bit hard to understand. So we can have another option which is to separate the two changes into separate rules. So here we can have a single file and it has two semantic, it has two semantic patch rules in the same file. So this is the first one and this is changing PCI map single. This is doing all the common part basically, changing PCI map single to DMA map single and then it changes the first argument to wrap the and arrow dev thing around that first argument. And here I just leave the other arguments the way they are. And then once we have changed all our PCI map singles to DMA map singles, then we can work on the fourth argument. So here we are just matching DMA map single again and then we go focus on the fourth argument and we can change it to PCI, change it in whatever way is desired. Basically when you have multiple rules, it takes each of your files and for a given file it applies the first rule to every function in the file. And then it moves on to the second rule and applies the second rule to every function in the file taking into account whatever changes have been made in the first case. So this will here is going to modify all calls to DMA map single. It's going to modify the ones or all calls where we still have a PCI constant in the first argument. It will modify the ones that we created here. And if the code randomly has some other DMA map single calls that are using the wrong fourth argument then it will be modifying that one, those as well. So it's kind of a different functionality from the one that I proposed before. So we could have actually a fifth attempt that I don't show where we have this rule here first it changes the PCI map single to DMA map single and it adds the field reference and then we could have another rule which is only this part that just gets rid of all of the PCI DMA constants in the entire system. But that might be doing more than you actually want at a particular time. So it seems somehow safer to just focus on a particular call and its particular arguments. Julia, we do have one question in the Q&A box. Would you like to? I will try to see it. Okay. Here we go. Q&A box. So while I'm trying to figure out the Q&A box you can start working on exercise three. I can read that out to you, Julia, I feel like. Yeah, I think you'll have to. Don't the changes need to check that unneeded headers are replaced by any new ones. Okay. The question is about header files. I mean, I'd rather just answer about header files in general. Yeah, maybe I wouldn't actually. So I think the concern is that PCI DMA map, PCI map single might have its prototype in one header file and DMA map single might have its prototype in another header file. And the code that we're going to generate then would not compile because we have changed the function name. And so I have no recollection whether that's issue here or not. But with Coxsinoe you can also specify changes on includes. So if there is a, maybe there's a header file PCI.h and maybe you decide you don't want that one anymore you can remove it and replace it by DMA.h. Maybe you want to keep PCI.h but you want to just add DMA.h. Maybe you want to be sure you want to see whether DMA.h actually exists already. And if it doesn't exist already, then you want to add it. So it's possible to do that. You can have one rule which depends on the success or failure of another rule. And so there's a notation. Basically our rules we have these at, at sign at the beginning and there's actually we'll see a little bit of this but you can see more in examples I think in documentation. In between that at, at there's some information you can put to kind of give some information in general about how the rule should be applied. One thing you can do is give a will a name and so you can have a will which is going to match the presence of like include DMA.h. And you can have another rule which is going to say depends on the failure of the first rule, whatever the name of maybe the first rule was called ink. And so you can say depends on exclamation point ink, not ink. And so that second rule will only apply if there was no include that you wanted in the file. Remember we work on one file at a time. And so that way you can add the header file if it's needed. I know that in the Linux kernel, for example there's some conventions about the order in which header files are introduced and other software might have other conventions and Cox and L unfortunately doesn't know really. It can't really help you with that. In general in Cox and L when you want to add something you have to attach it to something that exists. And so if you think about maybe you want your header files in alphabetical order or in reverse Christmas tree notation or something like that. So Cox and L doesn't know that. You have to specify something, some place where you want to put the thing and you can put your new thing either before that thing or after that thing. So particularly in the case of header files if you want the includes to be organized in a particular way, you may need to actually go and modify the code afterwards. So I think in the interest of time I'm going to go on. You can continue to try the exercises while I'm talking or try them on your own installation of Cox and L afterwards. Julia, there seemed to be one question in the chat. Not a question but more like a comment. I was wondering if you want to answer that. Katcha check on Linux 5.8 RC3 seem to be broken somehow. Seems like people are seeing errors. Okay, for questions about make Coxy check I would suggest that people contact me afterwards. That's all we have on there. That's all we have on there. Yeah, questions twice. I really appreciate the feedback about it but it's hard for me to answer questions about it in real time. Okay, so the third aspect of what I called simple simple is dots. So basically this is addressing the case where we don't want to, I mean, our previous examples we were just looking for a single expression that we wanted to transform in some way but sometimes you want to search for multiple piece fragments of code that are somehow connected to each other. So basically the idea is you have some code where some part of it is executed and then the execution continues and then some other parts of it are executed and you want to match those different separate pieces of code fragments. So what we would like to do is to specify patterns of consistently of fragments of code that are connected by execution paths. And then we also will see in the end how to specify constraints on those execution paths. So first we have some, we execute a certain piece of code and then maybe you want to say certain other things should not happen until we come to some other thing which is of interest. So our example is again going to be this casey elic example. Casey elic is the Linux kernel way of allocating memory and if it turns null, if there is not enough memory. So basically it's a where occurrence but sometimes it could happen and so in general the policy of Linux kernel has been to check for these kinds of issues. So here's some good code. We say check state equals casey elic which a bunch of arguments. And then immediately afterwards we say if check state is null, if then we just return and that problem will be dealt with in some way. Here's some bad code. We have ddip equals casey elic again with some arguments and then we immediately dereference ddip and so if the result of this casey elic is null then here we will have dereference of null and that will be a problem. So like the other situations we've been looking at it's not very easy to find, I mean, it's sort of possible to find these two cases with grep. You can say grep and you can say grep for casey elic and then look at the next line. And then in some cases at least the next line will be either the if or it will be some kind of dereference and then you can find the bad cases and the good cases but it's not very reliable. Sometimes the test could be many lines later and sometimes as we'll see in the next example the one we saw previously, the actual first reference to the allocated thing also could be much later. So just looking at a fixed number of lines is not going to be good enough. What we want to find is the case where we have execution of a casey elic and then execution moves forwards across the execution. We never come across a null test but eventually we do come across a dereference and so this is going to be a problem. So this is what dots are for. So what we're going to do is we're going to see how to make a semantic patch to find this problem. Again, we can start with a typical example of code. I took kind of the more complex case in general that's kind of the best one. And the first thing we're going to do is highlight the code that we are interested in or the line of code that we're interested in. So here we are interested in this dereference. So what I've done is I put a star on the left-hand side. So this is saying just we don't actually know what are the changes that we want to do. The problem with adding error handling code is it's obvious what kind of error you want to check for but what you should do in each case is completely dependent on the file. Usually when something fails, you're going to want to do some kind of cleanup based on what has been done in the function before and that's completely specific to each particular usage context. But what we can do is we can find the problem and then the person can go look at the code and try and figure out what to do with it. So we have a star here to indicate that this is what we're interested in. And then we do kind of the same steps that we have done before. The first one is going to be removing the unimportant stuff. So basically we just remove all of these statements in between the KZ-ELEC and the dereference because that's the part that we're interested in. And here I've put that, that, that. So we have two separate statements but we want to say that they should be connected in the flow of execution. First we're going to allocate the KZ-ELEC and then sometime later in the execution we're going to be doing the dereference. So that's what the dot, dot, dot means. And then we move on and we can abstract over our relevant subterms. So actually we don't care at all about the arguments to KZ-ELEC and we don't care about this SI data. We don't care about the specific name but we do care that the SI data is the same as the SI data which is down here. So this should be the same as that and we don't care about what the field is. And I should have gotten rid of actually this ETH broadcast adder. That's also completely irrelevant. That should have been dropped previously. So we end up with this. So basically we have the abstract part is we have some random expression and we have some random identifier and the expression is going to be an assignment to the result of the KZ-ELEC and then some execution happens and then afterwards we have a dereference. So the dereference is going to be the same expression as was on the assignment for the KZ-ELEC and then we have the arrow and then we have the F. So we don't care about which field we are actually accessing. And so now we have a complete semantic patch and we can try it out on our examples. So we try it on this example and it works. Okay, so this dot, dot, dot here it means that we have this statement and then we have an execution that covers zero or more other statements and then we have this dereference. So actually in this case here this is the first line is going to match the KZ-ELEC pattern and then we have zero statements actually and then we come to something that has a dereference. The star that we have here, the star causes coxonelle to produce something that looks like a patch. It looks like a suggestion to remove this line but when your semantic patch contains star the minus is just highlighting for you that this is one of the lines that you wanted to see. So it looks like a patch but you should not apply it as a patch. Unfortunately, this rule matches lots of other things in particular matches are good example because you can see here on the first line we have a call to KZ-ELEC and then we have some stuff that happens and then we have a dereference. So we didn't say anything about null checks in our rule and so coxonelle has no idea that null checks are something that we're interested in. Our rule is just saying look for KZ-ELECs and look for dereferences and that finds this one here. So our rule it works but it's somehow not really doing exactly what we wanted to do. So the advantage is so you can make a rule and then you can test it and you can see what it does. It does some good things, it does some bad things. If it does a lot of bad things then you can think about the rule. Since you wrote the rule by yourself you know what it does somehow. You can think about it and you can adjust it in some way and so that it ideally gets rid of the bad results and keeps all the good results. So basically our problem here is that we find a KZ-ELEC and then with the dot dot dot we're going to look forward in the execution and when we're looking forward in the execution we might find a null test and if we find a null test then we don't need to look anymore because if you find a null test then everything is okay or if we look forward in the execution we might find a dereference and if we find a dereference then we if a dereference is the first thing that we find then we want to report on it because that's going to be a problem. So basically we have several different options. So we've seen already how to do that. We want a disjunction. The thing with bars in column zero and the parentheses around the different options. So here we have our pattern E equals KZ-ELEC which is just like it was before. We have our dot dot dot we want to move forward in the future in the possible executions through the code. And then as we do this searching sometimes we might find a null test sometimes we might find a not null test and sometimes we might find a dereference. And if we find a dereference then we want to complain about it. So we keep the star on that case. So in the other cases if we find for example a KZ-ELEC and a null we will actually get a match but there are no stars on the matched code and so we don't actually report anything. So this disjunction it goes from top to bottom and so if it matches the first one then it will stop. If it matches the second one then it will stop otherwise it will match the third one. You might also notice that here I've put E equals equals null whereas in our actual code fragment there is a exclamation point and then the thing. So unfortunately in C there's many ways to do null tests. For the moment I will talk about this a little bit later if time is available. This is the isomorphism thing that we discussed at the beginning. There's isomorphisms are ways of rewriting the semantic patch so that it will match different things. And so actually when you write E equals equals null unless you do that isophile empty.iso that we had in the beginning if you just use coxon on the normal way it will also consider it will rewrite the semantic patch so it also considers null equals equals E and exclamation point E. Okay so now we can take this rule and we can apply it to our test cases. We apply it to our first test case we find the case of the ELEC we search forward in the possible executions we don't find any null test and we find a dereference. So we stop but there was a star on that dereference and so it's going to get a result with indicating it with a minus. On the other hand if we apply it to this code we first we match the case ELEC then we search forward on the execution paths and we immediately run into a null test and so it's just going to stop. Even though there's a dereference afterwards it's never going to find it it's just going to match as much as was shown in the pattern. But then we still can find some problems. If we look at this example it looks exactly like almost exactly like our first example we have a case ELEC and then we moving forward in the execution path we have a dereference so the rule is going to complain about that but if you study in the argument list then if you know something about how case ELEC works you can see that one of the arguments is GFP no fail and that means that the call can never fail and so there's no point to having a null test if the null can be never returned as the value. So this is a case where you might make a choice. We have our semantic patch it finds some things. Some of them contain this no fail but it's pretty easy to see that that's a problem so you could just be satisfied if it's good enough you can just ignore the bad results and adjust the good results in as you want or you can extend the rule to get rid of this problem. So actually there's quite a number of these no fail cases so it could be useful to extend the rule to search for them specifically. So basically our situation is we have case ELEC calls and we have two choices again one choice is that there is a no fail in the argument list if there's a no fail then there's nothing else to do. The other choice is that there is no no fail and in that case we want to search forward for a null test. So I've made this in a bit of a specific way observing that the no fail is always X or no fail. So this is I just want to do that for simplicity you can also there's also a way to say just no fail should appear anywhere in this argument. And since we had two choices we have again a disjunction. So in one case we have the no fail situation and in the other case we are just going to start at our case ELEC and we will move forward and we will search for a null test or search for a dereference in the same way we did before. You notice there's these two close parentheses here this time this one matches up with this one this one matches up with this one there are all in column zero because they're all involved in disjunctions. So we can take this rule we can apply it to the Linux kernel. We get 30 results. It takes a few minutes maybe 10 minutes or so to look through them. We find that we have 16 bugs and we have 14 false positives that is we have 14 cases where the rule matches but there's actually no problem. And so the main reasons for these issues are what are called aliases. So basically we take whatever expression here as we store it in some other variable and then we test that other variable for being null. And so the rule doesn't detect that situation because the rule is only looking for tests of the same variable that was used in case ELEC. And another situation is where there's alternate tests the test is not done using null but it's done using some specialized function. So you could say, okay, the false positive rate is like 40 something percent which might be considered to be fairly high but on the other hand we could evaluate the results fairly quickly and so it's not really a problem. Or you could say, oh, I found a lot of problems with aliases so maybe I'll try to do something about that case. So we have one more variant which is that case. So in alias in general it's like you say X equals case ELEC and then you say Y equals X. That is you rename the variable that was used as the result of case ELEC. And so basically that's something that can happen somewhere between our case ELEC call and our dereference. Those are the two points that we're interested in. So we have some execution path between them and along the way something undesirable can happen. So one more extension as compared to the C language is this when thing, when you have data thought you can say when and you can say some things that you would not like to have happen along the path. So we don't want the creation of any aliases and then kind of as a sanity check you might also want to flip things around and you say you don't want to redefine whatever variable is holding the result of case ELEC because if we redefine it to be something else then it might be quite all right to have a dereference. So you can have as many of these when things as you want the when they can be expressions, they can be statements they can be anything you want, numbers, anything but they have to be complete expressions or complete statements or complete whatever. And so if we make this change in this case we remove four of our false positives and we keep all of our actual bugs. So it was a positive improvement. Okay, so I think we are actually almost out of time, right? Yep, we just have about five minutes left, Julia. If you wanna finish and then Shua can jump in with kind of the wrap up, that'd be great. Okay, so I'll want to just skip very quickly through the rest and then we can do the wrap up after that. So here's the exercise, you can try these on your own. Basically we've seen some summary I've talked a little bit about isomorphisms if you have X equals equals null in your semantic patch it will kind of deal with it and do all the other kinds of null things you expect. Type to meta variables. If you want to constrain your meta variable to have a particular type like a struct platform device then you can do that. Any kind of expression can have a particular type. Rule names in ordering, you can give a I mentioned this with respect to the include files you can give a rule a name and then you can refer to the name from other rules and that can be very useful to share information. And then finally, we have positions and Python. So what I've shown in all of this talk have been the rules that match against the code. You can also freely intersperse things where you do things with Python scripting. So then you can do any kind of computation that you want. You can calculate, you can count how many occurrences there are of K-malik for example in your code. And interesting thing you can do is you can notice the position of certain things. So here I'm saying position is a certain kind of meta variable. K-malik is at a particular position and then we can print out a message in Python saying that it's on the particular file and a particular line. So in conclusion, Coxinal provides a declarative language for program matching and transformation. You don't have to know anything about abstract syntax trees, about parsing, about control flow graphs, all of, about type inference, all of the power is kind of integrated into the tool. This Coxinal Semitic patches look like patches. Hopefully they fit with Linux developers habits. Hopefully it's easy to learn. It seems like many kernel developers have tried to use it in some way. And we have actually made a bunch of other tools that are kind of building on Coxinal in some way, inferring Semitic patches from examples, using Semitic patches to search in Git histories. And we have another tool for identifying bug fixing patches that uses Coxinal internally. And so here is the website of the project. You can find a lot of information about documentation. You can find downloads. You can find the latest version of the source code also just doing on GitHub, just search for Coxinal GitHub. And we have a mailing list. We'd be very happy to hear from you. So thank you very much. Thanks, Julia. Shua, would you like to jump in and just kind of share a few of the resources? Yes. So thanks again for joining us today for this session. We hope it will be helpful for you to continue for your continued learning. And this series is for the reason we are hosting this series is for you to be able to have a self-study resources for you too to learn more about open source and how Coxinal tools and so on. And we also have other ways you can learn using these programs. We have LF mentorship program is designed for helping new developers with necessary skills and resources to experiment and learn contribute efficiently to open source communities. So you can take advantage of that. We also have Outreachy. Outreachy is a community-driven program. Internship offers remote internship programs just like LFX remote opportunities and supports diversity in open source and free software. We have at the Linux Foundation we also do training opportunities. The Linux Foundation training offers a wide range of free courses. Please take advantage of those as well. These links you can go to these links and find out more about these programs. Then lastly, but the Linux Foundation events provides educational content across wide range of skills levels and for people to attend conferences and collaborative be able to come in and learn more about various topics that are covered at these events. So it's a great place to meet others in the community to collaborate, exchange ideas and expand job opportunities and more. You can find all about events at the eventslinuxfoundation.org. And thanks again for joining us. Please reach out if you have any questions. Thanks everyone. Have a great day. Thank you. Bye.