 I warn people that I might start a little early, and so I'm going to hold true to my word, because I've got a lot of things I want to say, so I'll get through some of the basics already. So this is sort of the third piece of presentation I'm doing here at the conference, and there's a whole bunch of sort of container in isolation type topics. I did see groups yesterday, there were at least a few people I saw yesterday who were here today, so that's good. And I'm doing sec comp now, and on Wednesday I'm going to talk about user namespaces. So if you want to find out about all the bits and pieces that are used to build sandboxes and containers, I'm going to cover most of them. Excuse me a moment. Sorry about that. So sec comp. So, oh, just a little bit about myself. I'm the maintainer of the Linux man pages project. I wrote a book. I do training courses on stuff like this, just in case. So, sec comp. What's the fundamental idea that's going on here? The kernel provides, I don't know, 400 or so different system calls. The way most of us think about system calls is there are a way of asking the kernel to get something done. The way an attacker thinks of those system calls is this is 400 different ways I can try and subvert the system. Now, there are 400 different system calls, but most programs only make a small subset of those system calls. It'd be quite typical that many common programs in their lifetime would only make, say, at the most 40 different system calls. And the idea here is then, suppose we have a program we expect to make a certain set of system calls, but for some reason it makes one of the other system calls. Now, the possible reason there is the program has been subverted and tricked into executing malicious code. And so, if it tries to execute a system call that we don't expect, we want to stop it doing that because something is wrong. And that's what sec comp is about. Making sure that if a program tries to do things that we don't expect in terms of system calls, it's prevented from doing so. Okay, so what we're doing then effectively is reducing the attack surface of the kernel. Now, if a program gets subverted, we expected the program to, let's say, make 40 different system calls, but now with sec comp, it can only make those 40 different system calls, which limits the possibilities for the attacker to try and subvert the system. Alrighty, so sec comp has actually been around for a long time. It was first implemented back in 2005, but in a much more limited form. The way you set up sec comp back then for a particular process, you write one to a certain proc pit file. And when you did that, the process was then in strict sec comp mode. And in strict sec comp mode, the process was only allowed to make four system calls. Read and write to our own files, exit to terminate and sig return. Now, sig return is a system called you'd never normally make directly from an application. It's used under the covers to implement signal handlers. So with this strict mode, the only thing you do is read and write files that are already open or terminate or catch signals. If the program tries to make any other system calls, if the process tries to make any other system calls, then it gets killed with a signal, sig kill. And it's dead straight away, of course. The original idea here, when this strict sec comp mode was implemented, was to create a marketplace in CPU cycles. The kernel developer concerned Andrea Archangeli had this idea that you could sell your CPU time to someone else. They would provide you with perhaps some byte code or perhaps some native code but did some compute bound task and you could run it on your system safe in the knowledge that the only things it could do in terms of interacting with your kernel were reading on writing on file descriptors that you'd given to that program because one of the system calls that isn't allowed there is open to open a new file. So you can only read and write files that have already been opened for you or terminate or catch signals. So for a long time, this CPU marketplace idea never really took off. And so for a long time, sec comps had around not being much used but things changed seven years later in 2012, Linux 3.5 when sec comp filtering was added. And this was such a big change to how sec comp worked but sometimes this is called sec comp too. And the key point here now is you can load a filter program into the kernel and you can choose which system calls are going to be allowed and disallowed. So you no longer have just the strict mode where there's only a certain four set of system calls allowed. You can say, well, actually, this is the set of system calls I'm going to allow and everything else is disallowed. Or perhaps you can just say, I want to disallow certain system calls and allow everything else. Okay, the way you did this was you used PRCTL, which is one of these horrible multiplexing system calls that does dozens of different things to a process. And one of the things you can do is set a sec comp filter for the process. So now you can choose which system calls are going to be allowed and disallowed. Now, people wanted this sort of feature for a very long time before sec comp two was invented. People have been wanting this ability to filter system calls in some way for many years. And they've been various mechanisms proposed over the years, but they all were rejected by kernel developments because they were perhaps over complicated or difficult to maintain or somehow unsuitable for being merged into the mainline kernel. And when this feature did finally get used, when it finally get merged, then people started using it pretty quickly. And by now it's being used in a lot of tools. And this is just a few examples. So for example, the web browsers are using it, the container frameworks are using it, tools like Firepack and Firejail, Flatpak and Firejail are using it, SystemD is using it, SystemD uses everything, which is a good thing. Okay. The work on sec comp is still ongoing. There's even new stuff getting, interesting new stuff getting added right at the moment. But especially back in... Sorry, I've jumped ahead a little bit. I'm sorry, I've jumped ahead a little bit. The next 317, then there was a new system called added called sec comp. And this is the sort of more modern way of establishing a sec comp filter. And it provides more options than the PR-CTL system call. And as I said, there's more work on going, so there's a lot of features that were added in kernel 4.14. There's a whole lot of other features that are being worked on at the moment that are likely to land in the next kernel release, or not the next kernel release, but the next one or two releases after that I would estimate. Okay. So, what's going on here? The fundamental idea is we can write filter programs that make decisions about system calls and those filter programs can make decisions based on the system call number and the system call arguments. And when I say arguments, I mean the register values. So, the filter program can say, I like the system call, the system call number, and I like the values in these registers, or I don't. What I'm trying to get across there is the filter program can look at the registers that contain the arguments. Some of those arguments might be pointers. The filter program can't dereference the pointers. Obviously, that would be an interesting thing to do, especially if the pointer pointed to, let's say, a path name, but the filter programs can't do that sort of thing at the moment. There is someone who's working on adding that sort of functionality. There's a new LSM that someone I know has been working on for two or three years called Landlock, which is intended, among other things, to bring that sort of functionality to Seccom, but it doesn't exist at this time. So, in order to use Seccom, we, in our user space program, we go through a few steps. The user space program builds a filter program, a kind of binary blob, which is interpreted by a virtual machine. It installs that filter program into the kernel. The kernel has this virtual machine implementation, and then the program executes untrusted code. In other words, some arbitrary third-party code that we don't necessarily trust, or perhaps some code that we feel could be compromised. And, of course, the way that code's gonna be executed, either we exec a new program or perhaps we've dynamically loaded a shared library, in other words, a plugin, and we're gonna execute functions from that plugin. And now, from this point onwards, from the point where the filter is installed, every system call gets checked to see is it a permitted system call or not. Okay. Once you've installed that filter for the process, or once the process has installed that filter for itself, the filter can't be removed. This makes sense. A filter is a kind of declaration. We're about to execute some code that we don't necessarily trust. We don't want that code to be able to remove the filter. So, once a filter's established for a process, it's permanent. Okay. So, these SET-Comp programs are expressed using the BPF, BPF language, Berkeley Packet Filter. Now, it's quite possible that many of you have heard of BPF already, of course, because it's used with TCP dump. And TCP dump has been around for, I don't know, 25, more than 25 years now. And the way that BPF is used with TCP dump, TCP dump, of course, it's monitoring network traffic. And a notable characteristic of network traffic is there's a lot of it. And mostly you're interested in a small piece of the conversations going on between two endpoints. You don't want to see everything else that is getting sent across the network link. So, what you want TCP dump to do for you is to filter the information. So, you only see selected network packets. And that's what TCP dump does. Now, that filtering could happen theoretically in user space. In other words, the TCP dump could put the network device into promiscuous mode, get every packet into user space, inspect the packet. That's possible. The problem with doing that approach is the sheer volume of data that needs to be transferred across the kernel user space boundary would put a big load on the system. So, just transferring that data is expensive because there's so much of it. So, the idea with BPF and TCP dump is that with TCP dump you can install a BPF program in the kernel and that BPF program does a check on the network packet header and decides is this an interesting packet or not. And if it's interesting, then the packet gets transferred across the kernel user space boundary. And the brilliance of SecComp was to realize this virtual machine that is being used for inspecting network packet headers, which of course are just a bunch of bytes, this could be generalized to inspecting system call numbers and their arguments, which are just a bunch of bytes. And so, that's why when that idea of SecComp was proposed to go from the point where it was initially proposed to the point where it was actually merged into the kernel only took about a year, which is pretty good for a major kernel change. But part of the thing that assisted there was reusing this existing technology. Okay, so BPF defines a virtual machine. It's a virtual machine that's interpreted by the kernel. This virtual machine has a few characteristics. It's got a very simple instruction set, small set of instructions, all the instructions are the same size. This means the kernel can implement the virtual machine easily and in a way that is efficient fast. The kernel can also do things like verifying that programs are valid and safe. Okay, so one of the things about BPF programs, there are jump instructions, but you can only jump forward in a program. So programs are directed acyclic graphs. That means the kernel knows that every BPF program will complete. Because of course, if you had a BPF program inside the kernel it was executing that could loop, then you could conduct a denial of service attack against the kernel. Well, that's not possible because you can only go forward in a program. The instruction set is very simple. This means the kernel can verify that the opcodes are valid and the arguments are valid. The kernel can even do things like detecting dead code where there's some piece of code in the BPF program where there's a jump over the code but no jump into the code. So that code could never be executed. If you try and load a BPF filter program to the kernel like that, the kernel rejects it because it's got dead code. The kernel can also verify that every BPF program completes with a return instruction. The return instruction is basically an information from the BPF program to the kernel saying, do we like this system call or not? BPF programs are limited in size. They came up to 4K of instructions, which seems to be enough for most people's needs. So let's look at this virtual machine a bit more closely. It's got an accumulator register. It's a 32-bit register. There's a data area. This is the information the program can operate on. This data area is information about the system call. For example, the system call number and the register values, the argument values. The instructions are 64 bits in size and expressed as a C structure, and the C structure is defined in header files. The 64-bit instruction looks like this. It begins with a 16-bit opcode. At the end, there's a 32-bit operand that the opcode uses, and for some instructions, there are two other byte fields that are used, and these are used for conditional jump instructions where these are instruction offsets saying how far should we jump? And conditional jump instructions and BPF is a little bit unusual. There's two targets for every jump. There's a jump false target and a jump true target. So you can jump in either of two directions or either of two distances, I should say, depending on whether the conditional test is true or false. It's a kind of pseudo-assembler-like language. It's a virtual machine, so it's a basic kind of assembler-type language. We've got load instructions. We've got store instructions. One of the things you can do with BPF programs, as well as the data area that you can operate on, there's some working memory that you can use to store information that you've calculated and you want to save temporarily. There's jump instructions. There's the usual kinds of arithmetic and logic instructions, you know, add, multiply, left shift, and XOR and so on. And there's these return instructions which say to the kernel, do we like this system call or not? Should the system call be allowed to be executed? Okay, so we've got these conditional jump instructions. We've got conditional and unconditional jump instructions. The conditional jump instructions consist of the usual pieces, an opcode, saying what kind of condition are we testing, a value that we're going to test against in the operand, and then a jump false offset and a jump true offset. And in terms of the conditional jump instructions, we've got the sort of usual things you might expect, an equality test, a greater than test, a greater than or equal test, a bit set test. That's what the J set is. And if you look at that list, and that's the complete list, by the way, you might say, well, there seem to be some things that are missing there, whereas the jump not equal or the jump less than or equal and so on. Well, those other alternatives are just the false branches of the existing instructions. So the false branch from jump equal is jump not equal. Okay, and the targets for these jumps are expressed as relative offsets, a certain number of instructions to jump. Zero means don't do a jump at all, in other words, execute the very next instruction. Otherwise, you can jump up to 255 instructions forward. Now, if you want to jump further than that, there's another unconditional jump instruction, jump always. And there, the offset is expressed in the operand, which is 16 bits, which is way more than you need to cover the 4K of available instructions that are possible in a BPF program. Okay, so what the kernel does for this BPF program, for every system call, the BPF program gets executed, test is the system call and allowed system call or not, and the kernel provides a read-only buffer of data that describes the system call. We can, there's a header file here that shows us what that data area looks like. This is the data that's being provided by the kernel to the BPF program. And we've got a various piece of information here. For example, the first thing, we've got the system call number. Which system call are we executing? Now, down the end there, we've got a number of arguments. On Linux, the maximum number of arguments that a system call can have is six. So there's space in that data area to allow for up to six arguments. Now, obviously, the number of arguments that's actually used depends on the system call. Some system calls use no arguments. But of course, you know that when you write your filter program, when you write your filter program to access the right number of arguments. Now, there's a couple of other fields in there as well. An architecture field. This tells us what architecture are we currently executing? Is it ARM? Is it x86? Is it MIPS or whatever? I'll come back to saying why that's important soon. And then one other field in there, the instruction pointer. Now, this is telling us from where in the process's virtual address space was this system call made? Okay, so this is the real virtual address in the actual process itself that where the system call was made. And when I first came across it, I thought, what's the use case? Why would you want to use that information? And I invented fantastical use cases like you could devise a filter program where if the system call was made from a certain shared library and a certain range of the address space, then we want to stop the system call being made. When I finally got to talk to the set comp developers about why it exists, why that field exists, the answer was, because we could. Okay, now, you could, you know, if you're feeling very 1950-ish or 1940-ish, code up your BPF programs in binary by hand. There are certain productivity tools that make the job easier. At the very least, there are some macros to find in header files that make your life easier if you're gonna do this by hand. There are some other better productivity tools that I'll mention briefly later on as well. So, there's some macros then, for example, in the header files that are used to construct BPF statements and BPF jump statements. Now, all these macros are doing is taking values together, taking values and using to build an initializer for that 64-bit structure that contains an opcode, a jump true, a jump false, and an argument. So, BPF statement just takes an opcode and an argument, the argument, I don't know, for some reason it's called K, some history there, I don't know the details, and it takes those two values, builds the 64-bit initializer, the true and the false are zero because this isn't a jump instruction, and there's another macro BPF jump where you give it an opcode, an operand, and a jump true, and a jump false argument. And again, it constructs a 64-bit initializer. So, here's an example. What this instruction does, it's a load instruction, that's what the BPFLD tells us. Now, the first argument there is constructed by oring together various bit fields. Okay, these bit fields, these values here are just bit masks that are defined in header files. What this, the first argument here is being defined using three bit fields or together, it's a load instruction, then the BPFW says, what is the size that we're loading? It's a word, in other words, 32 bits, and then the last part, BPF ABS says, where are we doing the load from? And the ABS here means load from the data, in other words, the area that describes the system call. So, we're loading a four-byte word into the accumulator in preparation for doing something. Now, then the question is, well, which word are we loading? Well, we're loading the word at offset of structset.comdata.arch. Now, who's come across the offset of macro before? Usually, I find relatively few people have done, have seen this. What this, it's a handy little macro, you give it a structure, the name of a structure and the name of a field inside that structure, and it gives you back the byte offset of that field. So, going back to that structure definition there, we said, sorry, structset.comdata.arch, give me the offset of the arch field, what's it gonna tell me? What's the return value from offset of? It's gonna be four, four-byte integer. So, what we're saying there is, load the arch nature field into the accumulator. Alrighty, another example. This is a conditional jump instruction. We're saying, do a jump, the kind of jump we're doing, it's in a quality test, it's a conditional jump, it's in a quality test. Is the value in the accumulator equal to something? And then the question is, what's something? Well, the BPFK says, there's something that's in the operand of this instruction. And what's in the operand of this instruction? It's the value audit arch x8664. Now, this is just a magic value in one of the kernel header files that corresponds to the architecture x8664. Every architecture has a unique value for the architecture type, as seen by things like the audit subsystem and set-comp. Now, if the value in the accumulator is equal to this particular value here, then we're gonna jump forward one instruction. In other words, skip the very next instruction in the BPF program. Otherwise, we're gonna jump forward zero instructions. In other words, execute the very next instruction. Okay, another example, a return statement. This is information that the set-comp program is giving back to the kernel. It's saying, we're terminating execution of the BPF program now. And we're gonna tell the kernel, what do we think about this system call? So BPF return, terminate execution of the program, passing back the value in the operand. That's what the BPFK says. And the value in the operand says, ret kill process. We don't like this system call. We don't like it so much that the kernel should kill the process. Okay. So I mentioned this architecture field that appears in the data area, and I just wanna say a little bit more about that. And the point is that every BPF program, when we look at a complete program soon, every BPF program should check the architecture on which it's running. Now, there's a few different reasons for this. The first thing to realize is system call numbers are different on different architectures. The system call numbers on ARM are different from the system call numbers on X86. On X86 even, system call numbers on X86 32 are different from the system call numbers on X86 64 and so on. So we are building a BPF program that checks system call numbers, and those system call numbers depend on the architecture. So we better make sure we're executing on the architecture we really thought we were executing on. Because if we're not, then we're making the wrong assumptions about system call numbers. Now, why could things go wrong like this? Perhaps we constructed a BPF blob on one system, and then using some sort of configuration mechanism management system, we installed that BPF blob on another system and a program loaded it and installed that BPF filter for itself. But it happened by accident that the blob was built on one architecture but got loaded on another architecture. That could be a possibility. There are other possibilities though. We've got things like modern X86, modern X86 architectures that actually support multiple system call ABIs. So on X86 64, we've got the, sorry on X86, modern X86, we've got the X86 64 ABI, we've got the old I386 API, and we've also got the X32 ABI, which has 32 bit integers and sorry, 32 bit longs and pointers, but a 64 bit address space. And each one of those three architectures has in some cases at least different numbers for the different system calls. So I said when you install a SETCOM filter, it stays permanently installed for the process. But the thing is a process could exec different programs during its lifetime. And it might start off executing a 64 bit X86 program, but then that program executes an I386 binary. Now that binary is going to use different system call numbers and those system call numbers won't necessarily be valid when tested by the same SETCOMP BPF filter program. So it's imperative that the filter program always has to begin by checking the architecture on which it's running. Okay, now once a filter is installed, every system call that the process then makes gets tested against the filter. The filter returns a value saying, do we like this system call or not? And this is done with one of these BPF rat instructions and the return value that the filter gives back to the kernel consists of 32 bits. The first 16 bits, the top most 16 bits are some kind of action saying, what do we generally think about this system call? Is it good, is it bad? There's various choices you can make. And the bottom 16 bits are some kind of data that goes with the action. So this is the information that the BPF filter program is returning to the kernel to say, do something in response to our decision about this system call. And what sort of return values can we give or can the BPF program give back to the kernel? BPF RAT allow. Yeah, the system call's fine. Let the system call proceed and the process continue. Or BPF RAT kill process. We don't like this system call. We don't like it so much. You should kill the process straight away. And all the threads in the process get killed. They get killed as though, although the process gets killed as though, it was killed with a signal. Now I'll say as though because there is actually no signal involved. The process is killed outright. But to another process, for instance, a parent process or a P trace process that was observing this process, it would look like the process was killed with a SIGSIS signal. And the reason that SIGSIS signal or that decision was made is SIGSIS is one of the or is the traditional signal, meaning a process tried to execute an invalid system call. Okay, another possibility. The filter program could return to the kernel saying, sec comp rat kill thread. We don't like this system call. Kill the thread that made it. But if it's a multi-threaded process, then the rest of the threads continue, which is kind of odd, I find, but it's possible. Okay, another possibility is we say, sec comp rat erno. Now what this says to the kernel is, don't execute the system call. Make it look like the system call has failed. And in the erno value that's returned to the user space program, there'll be whatever value we specified in the bottom 16 bits of the return value. In other words, the sec comp rat data field. Going back there, remember the return value is two pieces. 16 bit action and 16 bits of data. Well, the data says what erno value should be returned to the user space program. So from the point of view of the user space program, it looks like the system call failed, but the program can carry on doing whatever it wants to do in response to the failure. There are a few other actions as well, but I'm not gonna try and talk about them. I just mentioned they exist. You can read about them in the manual page. Okay, so in order to use or to install a BPF program, a process uses either the PR CTL system call or the sec comp system call to say that we want to install a BPF filter. And we've got a series of arguments there. So for instance, for sec comp, this last argument here is a pointer to the BPF program that we want to install. I'll ignore the flags. You can read about them in the manual page, but this pointer there to the filter program looks like this. It's a pointer to a structure of type SOCF program. This tells us about the networking origins of BPF, a SOC at filter program. Okay, and what's in that structure? There's a pointer to a filter program, a SOC filter, again, networking origins. That is our actual BPF program. And up here, the lint field is the size of that program. Okay, now this is slightly complicated to explain, but I hope I get it across. If you're going to use sec comp, the process either has to be privileged or it has to set something called the no new proves bit. Now what is this bit about? So this means if the process is unprivileged, it has to set this bit. Now this is a process attribute, again set with the good old multiplexing PR-CTL system call. And what does this bit mean? It means that if this process now executes a set UID or a set GID or a program with capabilities, those, the set UID bit, the set GID bit, the file capabilities are ignored. Now, why is that? Let's suppose you're an attacker and you're an attacker, I've turned them off deliberately. You're an attacker who knows about sec comp and you know that people who write set UID programs aren't always tidy in what they do. They sometimes make mistakes. Perhaps they don't protect the return value from some system call that maybe could fail. And you think, hey, this developer of the set UID root program, they didn't check the return value on that system call because they assumed the system call would always succeed. But I can use set comp to make that system call fail. In other words, I can use set comp, this is what I hope, I could use set comp to make the set UID root program do something unexpected. That's the theory, okay? Attackers love that sort of thing. Making a privileged program do something unexpected is the first step in privilege escalation. Now, that's actually not possible though because if you're an unprivileged user, if you want to use set comp, you have to set this no new privilege bit first. And that says that set UID bits and set GID bits and file capabilities now no longer have an effect. So it's to avoid that kind of theoretical attack that I mentioned, it's not possible because of this. Okay, if you try and install a set comp filter and you're not privileged and you haven't set that bit, then the kernel says, no, sorry. Alrighty, let's look at a real example. So we've got a program here. What it's going to do, it's going to, first of all, it sets the no new privilege bit using PR CTL. Then it's got some code that installs a filter and then we're going to execute some system call open. Okay, and there's a little clue there with the next line. We won't get this far in the code. And that's because the set comp filter is going to kill the process when it tries to make the open system call. So let's look at this BPF filter. What have we got for instructions? Okay, there's my install filter function. The very first thing it does is define an array of struct, sock, filter structures. And we've got various instructions in each one of these array elements. The first instruction here says, load a word into the accumulator from the data area and then the word that we're loading is the word at offset of architecture. In other words, load the architecture into the accumulator. Then do a conditional jump. Is the information in the accumulator equal to the value in the operand? And the operand value is arch x8664. In other words, we're asking, is the architecture executing on x8664? If it is, we're going to jump forward one instruction. In other words, we're going to skip over the next instruction. If it's not equal, then I'm going to jump forward zero instructions to this return statement that says, kill the process. We're not on the architecture we expected. The system call numbers are not going to be what we expect. So let's get out of here, terminate the process. But otherwise, what we do is go further forward. We load into the accumulator a word from the data area. The word we're loading is the word at the offset of the system call number. In other words, load the system call number into the accumulator. And then we do an equality test. Is the word in the accumulator jump equality, testing against the value in the operand? Is it equal to the NR open value? Now NR open is just a value that's defined in one of the header files. It's the number of the open system call on the architecture. So if this is an open system call, jump forward two instructions, zero, one, two. Land down here. And that is return to the control of the kernel, telling the kernel, kill the process. And then, if it's not equal to open, then we jump forward zero instructions. And we test, we load into the accumulator the, sorry, we do an equality test. Is the word in the accumulator equal to the open at system call number? Now open at is a variation on open. So we're checking for both kinds of system calls that do an open. There are actually a few other system calls that open files as well. But this is my simple example. So if it is equal to the open at system call, jump forward one instruction. Or at zero, one, kill the process. Otherwise, jump forward zero instructions. And we land here. Every other system call is allowed. Okay. Here's the rest of my install filter program. There's my struct fprog structure. The field of the structure points to my actual filter program. The size of that filter program, I put here into lend. It's the size of the filter array divided by the size of the first element. In other words, the number of instructions in the filter. Okay. So when I run the program, then what's gonna happen is I see this being printed out by the shell, bad system call. What I don't see is that message from the program that said, you won't see this, that print fstring saying you won't see this string. And that's because the program or the process tried to call open. The BPF filter denied that. The kernel killed the process. And it made it look like the process terminated with a sig-sys signal. The shell saw that the process that it started looked like it was killed by a sig-sys signal and printed out the standard text corresponding to the sig-sys signal, bad system call. Okay. I knew I wasn't gonna have enough time. I think I'm supposed to finish now. Is that true? I think it is. Okay. I'm gonna race through. I'm not gonna define to another example. Okay. I'll just quickly say PRCTL and SETCOMP. One of the filters that you install, they might allow PRCTL and SETCOMP system calls. If that's the case, then you can add more filters to the process. So a process might have multiple filters in effect. All those filters get executors. That's possible. There are some other details there that I'm not gonna try and get through. If your filters allow you to do fork or clone to create child processes, the child inherits the filters. In other words, fork and clone can't be used to escape filtering. Same thing. If the filters allow EXEC VE to ex- so that you can exec a new program, the filters stay in place. So executing a new program isn't a way of escaping a filter either. Now, once you do this, when you install a BPF filter, this gets executed for every single filter. When you install a BPF filter, this gets executed for every system call. There is a performance cost, but in these days of spectra and meltdown, it matters less and less. Okay, so for example, on that example that I just showed you, where I had six BPF instructions in my filter, if I took that filter and instead applied it to a program that just calls the getPPID system call, which turns the parent process ID, that's, of course, a very cheap system call. If I ran that on an old system, it increased the execution of the process about 25%. Now, two things I wanna say about that. 25% if the JIT compiler wasn't enabled. There was a JIT compiler for BPF, which makes BPF go quite a lot faster. And that was 25% on a kernel that didn't have the spectra and meltdown mitigations installed, because nowadays getPPID takes 300% more time to execute than it used to. So, we can reduce that 25%, 6%. And then the JIT compiler reduces it to 2%. Okay, all righty. Just a little word of warning, BPF seems like a great thing to some people and it is a useful thing, but one of the things you're gonna do when you install it, when you create BPF filters for a process or for a program, is you're gonna ask yourself, well, which system calls does my program actually make? And that's not an easy question to answer. And there are various ways you might try S-tracing your process, you can go and talk to Dmitri here about that, Dmitri's gonna talk about S-trace later on. The thing is, there's no perfect way of answering this question. You hope that you've got it right because if you've got it wrong and your filter denied a legitimate system call, then your filter's gonna cause your application to crash. Okay, so you can use set comp to inject bugs into an application. And it's made even more complicated because in a program, we normally call wrapper functions, not system, we don't use system call numbers directly. And those wrapper functions, their behavior in the C library changes over time. What I'm trying to say here is you can't just create a set comp filter and forget about it. It's gotta be part of your continuous integration testing. It's gotta be unit tested like every other piece of code. Okay, so it's gotta be part of your general testing of your application. There's an article here that talks about that on lwn.net, the inherent fragility of set comp. Okay, I just want to very briefly mention there are some other tools then for improving your productivity with set comp is that idea of hand coding those instructions. Instructions gets old really fast. There is a library round called libset comp. It's a set of APIs where you can say, I want a filter that does this. Give me a rule that filters for the open system call or perhaps the fork system call. Now a rule that filters for the clone system call. And then you can say, I want a very set of rules. And then you can say, okay, now construct that filter for me and install it for the process. So what this filter here is doing, we're saying this context object here is the sort of general handle on which everything hangs. And we're saying initialize the object, add a rule saying that clone should fail with E-perm, fork should fail with E-not sup, load the filter into the kernel, and then execute some code, which is gonna be filtered. So then when we try and call fork here, fork is going to fail. Okay, so this is much more comfortable than coding up those instructions yourself by hand. Yeah, and I think I really must be over time now. So I won't say much more than that. And maybe if there's one or two questions, otherwise I'll let the next speaker get on here. Yes? Once the filter is installed, it cannot be removed. A filter that is installed cannot be removed. Can it be changed? And it cannot be changed. Okay, you can add more filters, but that only makes things more restrictive. A new filter can't change the behavior of an existing filter. Yeah, so if there's multiple filters, all of them must permit the system call. Okay, question? You have to yell at me, I'm sorry. I'm so sorry, I can't hear from up here. It's actually a very big room. Maybe we can talk about offline because I think the next speaker needs to get set up. Is it you, Demetri? It doesn't work, it's architecture. Imagine you have a video on some of the start family schools and you write all those of theirs, and the statistics is ended, and you have yours. Yeah, is there a portal way of writing set-comp filters? Lib set-comp is the best we've got at the moment. It does a lot of the architecture specific stuff for you, but it's not quite perfect, but it's the best we've got. There is an idea that one day, set-comp will be able to use eBPF, and for eBPF, there is a clang front end that enables you to generate eBPF code, but so far, set-comp can't use eBPF. It only uses classic BPF. It's not going to be a future proof against using uses codes. No, it's not, this is the inherent, the inherent fragility of set-comp, you know? It's those filters have got to be tested like every other piece of code in a future proof way. Okay, where is the next speaker? Yeah, sorry, look, please don't, there you go. There's any questions? Yeah? That's fine, you don't want to. Okay. There's any questions? There are two microphones, okay. Okay, sure. Okay, I thought, can you hear me? Yes? Okay, welcome. Thanks everybody for being here. This is a talk on how to write modern video camera driver for Linux. And I would like to use an example as a well-known framework, which is going through a removal or deprecation process to show how the system we work on has changed since the time set framework has been implemented. And hopefully I would like to give some suggestion and examples if you have drivers depending on set framework, on how to make them, on how to remove those dependencies and have your driver working for next kernel releases. So there's a few of my contacts. My name is Jacobo. This is my email address. This is my IRC contact. I'm an embedded Linux and free software developer and I work as a consultant and I've been lucky enough to work with the excellent Renaissance mainline kernel team in the last two years, which gave me the opportunity to contribute to Linux in a kind of a regular way. I would like to thank Renaissance, of course, for sponsoring me and supporting and giving this talk and these data activities. And that's the talk outline. So we're gonna look at what's happening at SSC camera, which is the framework that I've been talking about and what has changed since the day that SSC camera has been implemented. The main difference is that how system boots because we have moved from a world where system boots are using board files to assist through a firmer supported boot process. And that changes the way we discover and probe and creates devices. Power management has changed as well. That depends on how the image capture devices are show up in user space. And finally, I would like to give a practical example of a driver that was developed for SSC camera and has been made a more vivid for Linux driver in recent kernel releases. I would like also to introduce a bit of glossary because words are sometimes confusing. And to capture images, we need, of course, an image sensor that produces those images. And the receiving port, which is usually installed on the SSC. A sensor driver controls an image sensor and the bridge or receiver driver controls the receiving port on the SSC. On modern system, we still have an image sensor that produce images and an image receiver, but we also have several components on the SOC that takes care of image transformation and manipulation. Image sensor and drivers for those kind of components are generally called video sub-device driver. Let's start by discussing what's happening to SSC camera and start by saying that SSC camera was great, in my opinion. And can I ask how many people here have ever worked with SSC cameras? Okay, just a few, but... So if you work with that framework, you know, it was great because it provides you nice abstraction away for the crude V4L2 API, which might be kind of scary if that's the first video driver you have to write. It's kind of scary to deal with all the complexity of the video for Linux API, all the IUCT else we have to take care of, buffer allocation, and SSC camera obstructed all those things away in a nice way, and that's why it was adopted in a lot of drivers in Mainline. And I don't have statistics for that, but my feeling is that in BSPs and in downstream kernels, it was kind of everywhere. All the BSPs kernel I've been working with, if they have a camera driver, it was based on SSC camera. And it was so adopted because it has good points. Like, as I've said, it provided nice abstraction away for V4L2. It's the same framework for writing bridges and sensor drivers. So you learn one framework, you write two drivers that was nice. And also it provides an easy way to link bridge driver to sensor drivers because the two of them had to be linked in order for the bridge driver to cause operation on the sensor one. Of course, there is a bad side, since we are removing that, there is a bad side, of course. And SSC camera was developed in a time where system booted through both files and the support for OF or a device tree, and nowadays ACPI, which is gaining traction in better systems as well, is limited. It isn't there, I know, but it's limited, and we're gonna see why. It uses a set of deprecated operation, which is a fixable thing, but while the V4L2 API evolved, the SSC camera framework using those APIs has not been evolving the same way. So this is fixable, but it's not been done so far. And more than everything else, the media controller and sub-dev API that has been introduced like five, six years ago are actual gain changes because they change the way that how image capture devices show up in user space. So they change, SSC camera haven't really kept up with that. And what's happening to SSC camera? SSC camera, it's deprecated for a long time, so you are suggesting not to use that for right drivers, but it's gonna be removed, finally, because it's a long time that talks about that, possibly in the next kernel release. The last SSC camera bridge driver has been removed, has been important to be a V2F Linux driver last year, so there are no more platforms that depend on that framework. Although there are some sensor driver, in order of tens, 10 of them probably, that have not been ported yet, and they're gonna be possibly removed. There are discussion these days, moving them to staging or removing them completely. It's possible they're gonna be removed completely. And that's the file organization. We know that bridge drivers are usually drivers, media platform and SSC camera drivers are, drivers, media platform, SSC camera, and here we have no more dependencies. While for drivers, media, I square C, where sensor drivers are, we have some of them which will be removed. Currently, mainline, we have kind of a confusing situation because we have two driver from the same device, which is kind of confusing, but for next release, this is going away. So there are drivers here that needs to be ported here, and that's work to do. If somebody would like to contribute to that, it's a nice thing to do. What has changed then since the time when SSC camera was implemented? As we said, the device discovery and linking mechanism has changed. Nowadays we do that using notifiers and async matching. Power management has changed as well due to the way how video device are exposed in user space, and we now have standard frameworks for clock and regulators, and so SSC camera doesn't use deprecated frameworks for that. So every time it's possible, we should use the standard frameworks for dealing with those two things. Let's start talking about device probing and have a look at how device probe was performed in the legacy way. So we have five components here, board files, bridge driver, SSC camera, sensor driver, and of course the video for Linux 2 framework. The board file is nothing but the plain C file that register devices and drivers one after the other, all devices in the system, and at a certain point, you will add the platform driver for the bridge driver, so that causes the bridge driver to probe. At the end of the probe section of the probe function, the bridge driver will probably register itself to the SSC camera framework. That causes the SSC camera to do all the initialization operation, and at a certain point, it will start registering I square C devices. How does it do that? It does that using a VD4L2 function, which is this one, V4L2 I square C new subdev board, which creates a new I square C device that causes the sensor driver to probe. And how are those two identified? Well, the board file knows the I square C bus number and the I square C address of the device and passes it down in the cold chain until here, where those two information are used to identify the sensor driver. So in the old world, we have that devices are identified by the I square C addresses, and more important than everything, the device probing is sequential. So we have the bridge driver probing before the sensor driver. And that guarantees that every time a sensor driver probes, it has a bridge driver to connect to. In the new world, we have moved to a firmer-based boot process. So nowadays devices are creating parsing a firmer description of the system, and the devices are not identified anymore by I square C addresses, by the firmer node references. And again, more important than everything, there is no guarantee anymore on the probing order of the drivers. So this is a DTS, and in the DTS, we have a description of the video input port here and of an I square C bus. On the I square C bus, there is a sensor identified by an address, and the system builds, Linux builds and start parsing the DTS until it finds the video input port nodes. That causes the bridge driver to probe, and at a certain point, it starts parsing the I square C bus, which creates the sensor driver, which probes again, and can safely connect to the bridge driver. But we can also have the other way around. So the I square C bus is registered before the video input port. This is probed first, and the sensor driver probes completed its probe operation, but finds no one there to register to. And that might be a problem. It's actually a problem, because now device probing is totally asynchronous. We have no guarantees, which is the probing order. And again, we need to identify devices by the firmware node references. How to do that? Well, V4L2 framework to the rescue here, because it has two components that are designed for helping you, help drivers doing that exactly, which are V4L2 async and V4L2 FW node. How do they work and how driver use them? Well, we have a bridge driver again, DTS and the two framework components. And in DTS, we have a description of the input port and the output ports of the bridge driver, which has two ports connected to two remote endpoints, which are possibly sensor or sub devices driver. The bridge driver probes and uses V4L2 FW node framework to parse the DTS and collect references to the remote endpoints. Those two are collected in the form of V4L2 async sub-device, which is an abstraction provided by this part of the framework. And those two devices are collected by the bridge driver. What does the bridge driver do with that? The bridge driver stores them in what is called a notifier. It's actually V4L2 async notifier, which is provided by this part of the framework. A notifier does nothing but the collection of firmware node references, the bridge driver or a generic driver is waiting for. V4L2 asyncs maintains a list of all notifier registering the system. Now we have three, four in total, which is kind of a likely in a system, but it's possible, totally possible. And the bridge driver does nothing but register its notifier with devices is waiting for to V4L2 async. V4L2 asyncs maintains as well a list of waiting devices. This has devices or sub-devices that probed and no one is waiting for them. So they put in the waiting list. At a certain point in time, we have that the sensor driver probes eventually and it uses V4L2 FW node to parse its local endpoint and create a V4L2 async sub-device representation of itself. It will then register that to V4L2 asyncs, which adds them to the list of waiting devices, but the two of them gets matched. So there is someone waiting for this sensor. When the two of them gets matched, that causes the V4L2 async to call a callback on the bridge driver that bounds the sub-device to the driver. So in this way, the bridge will have and handle a reference to the sensor driver. The second, okay, we are waiting for two sensor drivers and the second sensor eventually will probably in future does the same things. Usually V4L2 FW node create a V4L2 async sub-device representation of itself and register that to V4L2 async that causes the same device to be matched and the sub-device bound and the bound callback to be called on the bridge driver. And so in this way, the bridge has reference to both the sensor driver is waiting for. There is another things I have not shown here, which is there's not only the bound callback, there is this thing called complete callback that it's usually called when all the sub-device notifier, all the asynchronous sub-device the notifier is waiting for have been registered, the complete callback is usually called here. The complete callback usually called creates all the user space representations, video device node and video sub-device node connected to the wall capturing infrastructure. There is this caution going on nowadays if it's a good things, if you have eight, let's say you're waiting for eight cameras and one of them is not probing, do you want your system to be working or not? So it should completely be called only when all these sub-devices are probed or sometimes it's a good thing to have a working system even if one of your sub-devices or camera phase. There will be discussion about that in the video for Linux 2 meeting on two days from now and let's see what's happened there. Of course, that's what we show so far is the situation where the bridge driver probes first but we wanted to solve a problem which is the asynchronous probing problems. So the sensor may probe first. So the sensor probes uses V4L2FW node to register its async sub-device and that gets added to the waiting list. Nobody's waiting for them because there are no notifiers waiting for this sub-device but in a certain point in future, the bridge driver probes and will register a notifier waiting for this device. The two of them gets matched and the two of them gets connected. So we effectively solve the problem of async, probing sequences using those two framework and the ones of you that knows SSC camera knows that SSC camera can do that, actually does that. It uses those two framework and so why, what has changed since then? What has changed since the time where SSC camera has been implemented is that now sub-devices can have notifiers as well. This has been introduced one year ago by Nicholas, Sakari and Laurent which are the main author of the V4L2 Async and V4L2FW node frameworks to support the Renaissance RCR CSI in infrastructure which has sub-devices that are connected to sensors. So we moved from a situation where we have a receiver which has a notifier and connects to a sub-device to a situation where a sub-device can have a sub-notifier and eventually that sub-notifier will be connected to other sub-devices. This can create a chain of arbitrary complexity. It's usually just one of two level but there's nothing preventing you from making more complicated things here and right now I think a couple of driver mainlines are usually that IMX, well, our car for sure but also IMX is now using sub-device notifier and it's expected that more devices will use this abstraction as well. Power management, as we said, power management has changed as well due to the way that video devices are now represented in user space and that depends on the way, on the introduction of media controller and sub-dev API. So media controller, the old device, the old world, non-media controller keep devices, they work with a single device node abstraction. So the one that we were all are used to, the dev video zero abstraction. So for a wall capturing infrastructure, you just have a single device node in user space and that causes all operation to be sequential. They goes through a single device node and gets directed to the sub-device. We now live in a world where media controller is everywhere and it's going to be everywhere hopefully in the next year and video device node are not the only abstraction we have in user space because we have also video sub-device node. And that causes all operation not to be sequential anymore but instead they can be perform on sub-devices and video devices at the same time. So let's see an example of that. This is a legacy system where we have the simplest possible capture infrastructure. So we have a sensor that is connected to an I2C bus and it's connected to a receiver port where it transmit pixels. In kernel space, they will be managed by a receiver driver, a sensor driver, that's the framework part which is coming to the kernel frameworks. And the user space, we will have just a single device node abstraction. So all this infrastructure is represented by a single device node. We have of course a video for L2 compliant application and which interfaces with all that with V4L2 APIs. Usually at the first thing we operation we have to do if you want to use the video device it's to call an open on this video device node. And usually at open time the bridge driver just powers up the sensor. So in that way, every other operation, the sensor driver it's ready to send pixel to the receiver. So the V4L2 compliant application start calling different IOCTLs, set formats, get formats, allocate buffers, whatever and the certain point and all the operation will be translated by the receiver driver to the sensor driver using a V4L2 subdev call operation. At a certain point we will receive a stream on so the application won't actually the sensor to stream pixels and this causes a lot of settings to be sent on the I square C bus and pixel we start flowing in this direction. This is what the modern device might look like a very simplified actually modern device might look like. So we still have a sensor, we still have a receiver port but we also have a lot of components on the SOC that takes care of image transformation and manipulation that can be resizing, conversion between one format and another formatting system memory, totally depends on your platform. And of course the drivers that handles that are much more different from the legacy one and they might have different components, they, you may have one ASP driver handling all three of them, one receiver driver, that depends on your platform but the important thing is that this is how it would look like in user space. So we still have the video device node zero which the application uses to start streaming and call a certain set of operation but we also have all these device nodes here which are subdevice nodes where the application can call subdev operation as well. So this is what may happen, we may have video, she's via subdevices I use it else, call on video subdevice node at any time and there is no relationship between one and the other so we might have those two call at different times and that causes all of your operation to be now asynchronous because there is not a shared notation of power settings anymore along all this pipeline. So the only thing, the only suggestion that I have for if you are to implement a driver in this kind of situation is always cache your settings every time because you never know the power state your driver is working in. You may receive a set format and your sensor might not be powered at all because we don't have the single entry point we add when working with no media controller systems. That calls for maintaining a driver-wise power state notation. Every time you should know if your device is powered or not and even better, if you want to make that ref counted it's even better because if you receive two set power that should not happen but who knows what user space is doing, you should receive two power off to actually power off the device. Also you should cache all your settings and apply them at a time where it is known the sensor or the sub-devices to be powered and that's usually stream on time because when you receive a stream on you should start sending pixels and at that time the sensor should be powered on. Also this is a general suggestion for not just for video devices but try to use runtime PM. Runtime PM provides even abstraction that it's more similar to the sequential flow of operation that we've seen before. It's easy development and ref counting of power states. Of course it's not always possible but it's welcome. Clocks, GPUs and regulators has said we have frameworks for that right now and they should be used whenever possible and relating to power management routine in the legacy world. We have the board file that provided power management routine to the sensor. So this is how SOC camera used to do that. There is an SOC camera link with a power callback and the board file just filled that pointer with a routine defined in the board file. So when the driver needed to power up and power off the sensor, it called these things here and the board file has all the references to regulators, resets, whatever it need to power on and off the sensor. Of course we don't have board files anymore, we have DTS or ACPI. And how would you do that? Well you should use the GPIO clock and regulator frameworks. Every time the interface with the DTS you collect references from firmware which is usually called DevM or not DevM depending if you want to use DevM clock or regulators get and use the name using DTS. And the driver itself should not rely anymore on the board files turning on and off the single components but is the driver itself that should enable or disable the regulators or the reset line as power time. Practical, so this is an example of a video driver that was developed using the SoC camera and in recent I think two releases ago has been ported to be a plain video for links to drivers. So I would like to go through all the patch, not all of them but some patches there and show you what are the steps that has been performed to do that in order if you have any driver depending on SoC camera and you want to port them and possibly submit them for inclusion that's a kind of a guidelines for doing that. So the first thing we did was just copy the SoC driver as it was from SoC camera to the video for Linux to SoC driver directory. That was a choice that allowed us to see the differences without having any modification in the first comment. Then the first comment actually removed all the dependencies from SoC camera from the driver and that's exactly what we have been talking about handling clock and GPIOs in the drivers and not relying on the board file for doing that. Register the async sub-device because SoC camera was doing that for you and now you should do that explicitly in your driver. Remove SoC camera specific deprecated operation in video for Linux to this operation are deprecated. SoC camera still depends on that and still wants them. So they had to be removed and re-implemented in the proper set format and get format operation. And then there are a few changes which are specific to this driver which are specific to this driver. So I'll just hear for reference. Of course the build system has to be adjusted but that's trivial. And then after a plain video for Linux to driver has been made out of the SoC camera depending on the fun begin because people actually start using that and that's exciting because patches are start coming. So people actually start using that and specifically patches have been adding components parts of to that driver that made the modern video for Linux driver out of what it was an old one. So the first thing we saw was adding media controller support to this driver. That means that the driver now have a sub-device as a sub-device in user space. So you now need to handle nested set power calls. That's exactly the thing that catch your setting and keep a power state notation in your driver. Also you should not access registers when the sensor is powered down. Now you can receive a get control from user space while your sensor is powered down and you should pay attention not accessing the sqrc bus why the sensor is powered off. There was support for frame interval handling. This is kind of a request right now if you want to submit a video for Linux driver frame handling is something that is kind of mandatory doing that at least for a few frame rates. That's another things that cause for shared state notation. So if your driver is streaming you should refuse you should return ebz or another error flags if you want to set format or change the frame interval. And the last change it's the creation of the sub-device now that goes along with the support for the media controller operations. So I've been probably too fast. So we have 10 minutes for question. I hope you have some at least because otherwise I've been really too fast. I had a hundred slides so I was worried that it was that time was not enough but actually I've been probably talking too fast. If you have any question or anything you want to talk about or any question about any discussion about how things could evolve in not just for sqrc camera based driver but sensor driver in general. There are two microphones here so please go ahead. So I'm just wondering when you're writing this new code you're adapting it. What sort of techniques you use for debugging and working on what's going wrong? Because it strikes me this is a more complex setup of everything being asynchronous and when it's not working you might not have much idea as to which bit isn't actually hooking up. So you get hungry, that's the first thing. I mean you get disappointed when things doesn't work. That's the first debugging tool you use usually. And well that depends totally on the system that you use but things are now asynchronous. So having a notion again of what is the power state that's always useful. And talking about streaming start and stop that's and now handle it through the media controller frameworks. So you have a pipeline notation. The pipeline it's, where's that? That's, it's basically a pipeline. All the components here are put in a pipeline and when you start streaming the media controller frameworks goes one on the other and call start stream on all of them. So having the bugging, adding the enabling the bugging the media controller frameworks help you understanding what's going on at each step of the capture process. But in the end you should just know what's happening and your sensor driver if you have problems streaming it depends on what problem you have. You don't receive images, you receive bad images. You are missing a set format call that depends on the driver usually. And what the bugging tools your system provide because compared to other parts of the system it's hard to debug those kind of thing using JTEC. Because there is an ice course bus in the middle so you should, the last resort is printing out all the messages you're sending on the ice course bus printing all of them. See what happened, go and go with a data sheet. Do the comparison. So there are different degrees of complexity you may want to handle. So I don't know if you have a specific use case for that or? Well, maybe it's not that you give that. We have time. So five minutes afterwards, so that's okay. Okay, thank you. Thank you. Anyone else? Okay, so thank you, more time for coffee.