 Hi, today I want to talk about some CPU-BPF production surprises we have seen in Matter. So first one is about a C-group get and set sort of BPR program. So what it can do for us is we can intercept, there's a hook in the CIS call, get and set sort of. We find it super useful, we are a big user of it. So one of the common use case we have is to create a matter-only socket options. So we do some special things for it. For example, in our get sort of BPR program could understand a specific internal only op-name and then do something special. So the example here is get something from the socket local storage and then we turn it to the user space. So it all works well. We test it in a lot of machine like the whole cluster, it was working fine until one day. A random surface try to do something in a sub-percent, less than 1% of the fit randomly doing something like this. Then it create an incident and it fell. And then we trace them because the BPR code path in the kernel, return even if you go back to the earlier program, BPR program we have, it doesn't change anything in that particular op-name that's causing this effort. The BPR program is not interested in it. It doesn't touch even anything but it still create an unspatter error from the BPR code path. So what we told the team to do is for the option name larger than the page size we have to set it to zero so that the BPR code path will explore this option name. This is a surprise for the BPR program users because they ask why only page size option name but not everything else. So if we look at the kernel implementation details, before the BPF guest sort op program get called, so the kernel does some preparation work before it call the BPF program, so one of the things it does is, for example for the guest sort op, one of the things it does in the kernel is it needs to allocate a kernel memory and then copy the whatever option names from the user memory to this newly allocated kernel memory so that the BPF program can read from this newly allocated kernel memory and this allocation is limited to 4K. So if the BPF program doesn't change this option name back to something smaller than 4K, the kernel assume the BPF program didn't do something, it assume the BPF program did something incorrect because the option names is inconsistent with whatever buffer it can read or write so it will turn EFORT. So that's a relief fix from stand which is to sort of explore the EFORT for this particular case when the origin option names is larger than page size because most of the time I would say 99.9% of the time it's just the BPF program is not interested in this large opening and just forgot to update opening to zero. So going forward, can we provide better user experience to the user? So if we go back to think about why we need to do Kmalloc, it's because the BPF program cannot sleep right so the BPF program cannot directly read the user space, cannot directly read the option value in the user space memory. Because of that then we need to allocate kernel memory for the BPF program to read. So to bound how much it can allocate, we bound it by page size and then we can also limit how much we don't do too much mem copy if we don't have to because most of the time the BPF program are not interested to most of the soccer option. It may only interested in one or two or three out of like 10s or 20s possible soccer options. So the thinking is can we allocate, can we can we avoid this memory allocation at all? Can the BPF program directly read from the user space memory for the option value? So if we made this BPF program sleepable properly, I think we should be able to do that. The newer LSM C group is already sleepable but the older C group BPF program is not sleepable. So probably something we can improve. So for reading, I think it should be doable to read the user space memory. But how about for the BPF program that need to change the option value? For get sort of can we directly write to the user option value? So for accessor, what happened if the BPF program want to create a new option value which is longer than or larger than the user space option value? I mean, how do we deal with this case? Do we need to go back to do K-Melo when the BPF program do want to write a larger option value in this case? And the question I have is does it make sense to fit all this in the ThinkPonder API? For sets of code, why do you need what is the case for having opt while larger than the original user buffer? Because you need to crop it anyway, right? You cannot otherwise you risk overwriting something in the user. For sets of code eventually you may want to fall back to the kernel. I don't know, we don't have this use case but the current code seems to accommodate this. Yeah, I guess in existing path where we have this hack where if the buffer is smaller than I don't know, four bytes allocate 16, which is like maybe we should not accommodate this case. Yeah, but I think maybe we need to think over it again because remember when we started those sets of code hooks we still had this ugly thing where most sets of code had this get from user put to the user and there's been some rework to unify all this, right? There is some abstraction right now we use, right? Solve something which depending, it basically knows where the buffer is in the kernel in the user space. So maybe we should plumb this abstraction to the BPF as well somehow, I don't know, or maybe I didn't point there's a nice abstraction. Right now it's always right into the kernel space, right? So if we take out the use case that the sets of may create longer options then can we always write to the user space instead? But it can fold as well, right? If you write to the user space. Same for the get sort of, right? When we get sort of eventually we need to write back to the user space anyway. So that may fall also. If it falls then we turn you fall. Yeah, I mean as long as your program is sleepable then you can do it. Yeah, I think all this better UI needs sleepable BPF in this place otherwise it seems like there's a bunch of hacks we had to do and then we mistake. So at the end point that you would basically have like the user memory versus kernel memory and then you could decide where to write it to or would that be the idea? Yeah, if we can take out the use case that the sets of may need to write a larger value, I think we can always, the thing point, the thing point can always stay with the user memory. Sorry, do you want to write to the user memory? Yeah. What if another thread unmobs it? What if another one? If another thread unmobs the user memory. But that will be the same case as whatever the kernel is doing also, right? How is it, why is it different for the BPF program? Want to write to the user memory? Because you would page fault? But the kernel also, for example, if the kernel is handling the get sort of, the get sort will also copy to the user memory, right? How is it different? I don't think it's. Copy will. And I think there's also another case if I remember correctly, it's been a while for I think maybe it was to get socket code when the kernel actually errors out, we don't even, we cannot, I think maybe overwrite this error or like give some user, give some buffer back to user space. I think there was like a corner case like this, right? Where it would also have been really useful to support it. So. Yeah, this one with additive fix, right? There was a case where when you don't pass the buffer, but you only want to read the kind of the size of the buffer back. It wasn't handled correctly, but it's, yeah, I think it's, it's a corner case. I think this is more like kind of logic in the sense that buffer management, how do we manage it in a non-sleepable hook where it's user memory and it's. Second one, which is the last one. So another one on the top, I was the C group saw ops so I'm with it as, so it's nothing related to the socket option. So it's a, it's a BP of hooks in the TCP stack. So what happened is it has a like 8 bit frags in the TCP socket. So essentially what it does is like if you turn a particular bit in this frags on, so the TCP stack will call you in some TCP stay. So, so here's all this possible callback frags that is available in the kernel. For example, when the state changed from same, we save to, to establish or to, to some other closed state. So the one that I'm gonna talk about, use it as example, it's about the header options, write header options. So you can turn on a thread to say, I want to write some TCP header option for, for some of the packet. If you're not, after you're done, you can turn it off so that the TCP stack will stop calling your BPA program when you need to write some header, when you need to write the TCP header or send some TCP packet out. So here's the current API to, as a helper to, to set this frags. So bitwise also, it's like the user thing, bitwise also to, to turn on a frag and then you add bitwise and mask it out to, to turn it off. So here's some use case to, to solve the problem. So, so here are two different BPA program. So both of them is want to turn on the write header options for the, for the, during the listen callback. So that, so that both of the BPA program will have a chance to write the, write its own TCP header options during the freeway handshake. So, but one of the program, program is after the freeway handshake, so it will get an established callback, right? Then, then program may say, hey, I, I, I still want to keep writing the header, so you keep it on for program B. For program B, it say, okay, I'm not interested to, to write any more header options after the, after the freeway handshake, so it turn it off. So, so the problem come in, so A run first, it keep it on, but B run later, it turn it off. Then it end up program A will, will ever get the callback also. So we, I talked about that with the, with both teams that added this secret BPA program. So end up, we conclude that we, once a frag is on, we can never turn it off. Because we don't know whether there are other BPA program in front of it. Is interested in it or not. And, and then we store something in the socket, local storage. So for each of this BPA program, it store something in the socket, local storage. So they remember, does it still, does it still interested in this callback for this socket? Yeah, if it say, okay, for some program A, it say, okay, I store a bulletin. And it's true, so it will, it will, it will keep writing header for program B, it will stay, it will store false in the socket storage. And then it will stop writing the header after the freeway handshake. Yeah, I don't have any better idea how to tackle it now. So I wonder if you guys have experience on this. You guys have idea how to, how to better implement in the kernel. Yeah, I told you offline that we, we've hit the same problem. Not with the header, but with some other callback, I don't remember. Yeah, and I guess for us, it's easy that it's also kind of a controlled environment where we can say, okay, we know we need this set of lags. So this program, we just go and fix it. But I don't know, maybe when we load the program, maybe we can add some new mask that you pass saying like, yeah, this is a minimal set of bits that I need to. And then when you try to disable them, you get an error or something. But I guess it's probably essentially the same as your first point where you just keep it forever on and don't reset. But at least it's more like the API level you present to the kernel contract. Saying this is the bits that I will be using and please don't turn it on. Maybe you can play with the like this whole C group inheritance story. All right, if, if, if you're attaching with an override that's or if you're touched with inheritance, you or those flags. But again, I didn't know like in the wild if anyone is using this. It's me and you. It's, it's how, how big is the problem really is. Yeah. I mean, do you, do you use atelium those socopes? Yeah, we've been using them a lot for like congestion control experiments, some parameters tuning. We do some contracting on them sometimes. I had a question about the BPFSK storage. Is that associated with just a socket FD or is there like a key, does the key also include attached type? Yeah. In this case will be like each program will have its own SK storage map. So you know which SK storage map is look for its own bulletin. So if there are two programs with the same attached type, would they get their own copy of the SK storage? They will have its own copy in the, in the storage. So they don't interfere with each other. So, but to, but to extend the helper to then lock the state once you set the flag. I mean, it doesn't really help because you're, in some cases you want to disable it again, right? I mean. Yeah. So essentially if you, if you lock the, that bit, I mean that bit is owned by that BPR program. Yeah. But it, but it come back to like, I think it's still come back to like if it's enabled and it's always enabled. So let's say one, one, one program lock it to lock it disabled and lock it then, then our program may do want to enable it. If we had infinite storage compute, whatever, right? What we really want is to store this mask per program. Yeah. Maybe that's not a lot of storage, right? If it's, we stick it somewhere in the socket. Is it, is it too much? Yeah. If we have, we can add something in the socket, right? We can increase the granularity of this bit. I mean, we still use single bit mask for like efficient data plane style. But then when we can actually change it, we, we, I think we can afford probably to store all the masks and then do the validation saying, okay, this program really is using still this bit. So let's not clear it off. Yeah. Potency could be like this BPR program itself can, can like install its own function pointer call back and then could be an array of function pointer call. If this array is larger than zero, then we turn on that bit.