 Hi, everyone, I'm Aditi Ghad, a software engineer at Isovelin. And I wanted to discuss two topics that will help us address and simplify some of the use cases in Selim. So I wanted to propose some extensions to the BPF infrastructure and get early feedback. So the first topic is having a network namespace unaware BPF socket iterator. Yeah, the title is pretty self-explanatory. But for Selim, here is a quick refresher why we need it. So I presented this topic at Plumbers Conference last year where Selim needs to iterate over host-wide sockets across all network namespaces, including the host network namespace, and filter client sockets that could be connected to deleted service backends. And in the interest of time, I'm going to skip over to the implementation details. But yeah, with the iterator, we plan to invoke this new K-fung that we are adding, which is the BPF sock destroyer, and then kill the filtered sockets. So I've added a snippet of how the TCP and UDP socket iterator match network namespace. So as you can see, the highlighted section in the code snippet, if a user needs to iterate over all sockets in the network namespaces, then user will have to enter every network namespace and retrieve the data. So as you can imagine, this is very inefficient, especially when you have a single host-wide socket's hash table. I believe there is this new alternative that was added recently where socket hash tables can be network namespace, per network namespace, but that's not the default option right now. So I have a proposal, as you can see on the left. We would need to allow this override option to end users, so I've listed two options. One is we can just allow users to iterate over all sockets in the host network namespace only. And the second option is if a user space agent or if a user has syscap net permissions, then we allow that user to iterate over host-wide sockets. So for the first option, however, it won't work for nested environments, so imagine Kubernetes in Docker kind of environments, which are which is known as kind for short, where the Kubernetes hosts are deployed in containers, and then on top, you have other application containers. So Cilium is one of the application containers. So what happens in these nested environments is that the host network namespace is the underlying host. And then Cilium, when it's running in the host network namespace, it's not actually the base host network namespace, but it's like the container network namespace. I don't know if that makes sense. No, yes. Yeah, I'll just repeat what Joe said. It's like each Kubernetes node is deployed in a separate container. With regard to exposing a flag to a user, having a new field in the BPF iterator attach ops is a good starting point. So currently, the BPF iterator link in Fosstruct has these parameters, as they call it, for various iterable resources, like maps, C groups, tasks. So we could extend this one for socket iterators. Either we could call it the socket, or it would be a separate target for TCP and UDP. So any comments so far? So as far as I understand, we don't have socket iterator yet, right? We do have socket iterators. It's for TCP and UDP. It's just that it has this network namespaces checks here. How is the UPI looking? Can we just repurpose? Do we have a NAT, NS, FD, ID, or whatever specified as a parameter? Not yet. Can we add it to the existing one? Why do we need a new one, I guess? Do we need a new iterator type, or we can extend the existing one? This is not adding. As far as I know, it's not adding a new iterator type. It's just adding a new flag that can allow users to specify this global flag. So I'm just thinking that bool global is very single purpose. And if you can just repurpose it to be a NAT, NS, whatever the ID for NAT, NS, then you can specify that, let's say, minus one will mean my own current NAT, NS. Zero would mean global, or vice versa. And then you could actually specify any NAT, NS. If you have enough permissions, then you can iterate it. Then it will be a little bit more generic. That is an option. What option should it be, like, two binary, basically? We are probably on the same page. It's just like how we expose this option to users. Question. So if I set the global to truth, what does it mean? It means I want to iterate all the level M space. And then for each level M space, I want to iterate all the socket. Correct. Yes. So maybe a question more for Andre. Andre, I think this is an existing iter called task file array. So it's like iterate all the tasks and iterate all the file under the task. So do you think it is something similar here? We want to iterate all the level M space. And then for each level M space, we want to iterate all the socket under each level M space. I think what you're referring to is the option where kernel maintains sockets for every network name space. By default, there is just one global. I guess it depends how you want to think about this. So for task iterator, or task VMA, for example, or task file, let's say task file that's more natural. The idea is that you're iterating over all tasks and all the files within tasks. So your input is two pointers, task and the file. And then you have ability to parameterize it and say, I only want to iterate files from single tasks. So we can do it here, probably. Like you can say, I iterate over all network name spaces and sockets within them. And then you can parameterize to only one specific network name space. Would that make sense? Because if you do a socket iterator, then the natural thing would be like iterate over all sockets potentially within just name space. And then you can parameterize it to iterate only one socket basically, which would be basically a socket lookup. So I don't know if it's a question to networking folks, but what feels more natural? It isn't at the default right now. So let's take an example of task resource. The way it works, the default is that you'll be able to iterate over all tasks on a host. And then you can parameterize saying that, hey, I want to iterate over only a specific task. With regard to the TCP and UDP target, though, the order is a bit reverse. By default, it's network namespace aware. And then we want to make it global. So you already have the option of parameterizing it. I'm trying to look up the UAPI because I don't know this. But I mean, it's still a parameter. I'm just saying that BooL seems a little bit too limiting. Sure. But let me look up the UAPI and we can talk. I think there's a question in the back. Chris Joe? Yeah, thinking about the kind of nested case, I'm just sort of wondering about rather than sort of global, which kind of assumes either you're iterating everything on the node, would it make more sense to basically the current namespace end below? If you think of it as a tree, then you're iterating the current namespace and then any sockets in, I guess, nested network namespaces. I don't know if there is such a relationship. Yeah, I did a host network namespace or not. So one of the reasons I bring this up is in the kind case, we may have cilium running in multiple containers. So each of those is a network namespace. Under that, we run Docker containers inside. This is mostly for a testing scenario. But the point being that cilium agent is running multiple times on the system. So each one of those would be iterating the sockets and it needs to kind of be aware of what's for its own kind of container. And I know I'm blurring things here, because it's not a one-to-one mapping of container and network namespace, but this sort of this nesting property from the container side. It feels like it would be nice if the cilium agent isn't iterating everything else from every other kind of node in the simulated cluster, should we say? Oh, and you can have multiple clusters like this as well. So multi-cluster, I guess it's one more layer of nesting. C-group has a mode, right? Like based on the pre-order and based on the tree post-order or these things. And if you would like to traverse nested and you can use the same approach. But if you want to have a mode like traverse everything and in the system, well, that's possible. And use a task and try to traverse all the tasks from the start of the IDR. I think that's also possible, depending on your use case. Yeah, I just wanted to quickly address your last point, being able to iterate over task and then iterate over hash tables. I'm just going back to this. Our main use case is being able to invoke this K-fung, which is the BPF SOC destroyer. And it can only be invoked from the TCP and UDP trader program types at this point. I have a question about the global flag. Are you mean the default behavior is well, iterate all the circuits in the current namespace? Is that the current default? Yeah, exactly. But in the other way, maybe when you out of the container, you wanted to specify us to iterate on specific namespace. Yeah, the second option that you mentioned is the default right now. Default, but it isn't only for your current. Current network namespace. Yeah, but if you also have a container, for example, and you wanted to specify, I have a separate container, a different namespace, and I wanted to travel iterate over the circuit of a specific namespace. Yeah, then you'll enter the network namespace. Or you needed to enter the other namespace. Yeah, and then run the iterator program. There's actually no such thing as ID to net and as lookup today, right? So you would still have to go to all the namespaces, check the IDs, and only then if it met. I mean, it's still better than. So you have the network namespace cookie, which is global and unique. I think there's also an ID internally for network namespaces, but that's an ID. But there's no such thing for lookup function. You cannot look it up. There's no mapping from kernel space to user space. Which is because the network namespace, not having any way to look it up, is just generally problematic across all observation tools. Like right now, what we do on the sort of tetragonside is we dig up the file. We're like, if we have a file, and we know this is the network namespace, so we'll just give it this ID, right? Like there's. You have to convince Eric Biederman, I guess. I don't know. I think that's a. I thought there's a net link in the face of something to our site ID already, right? No? Idea cookie. But that's from the net link. I think there's some way to read the idea. So there's definitely a helper in the PDF, right, to get it. But then there's something that sets a copter, gets a copter to read it. This is a recent thing, right? And that's it. If I remember correctly, the ID has to be. That mean has to assign it. I don't know what is the command or something, no. So it's automatically generated. Either way, if it's not automatically generated, we can enforce it to be generated an ID or something. I remember it's added to an ID out tree or something like that. And I just, I mean, what we would probably want on our side would be a map from cookie to namespace, right? So that we can always get to a namespace. I guess we could build it with a network namespace iterator and build it ourselves. But that would be one option. So yeah, this helps us too, because we can record the network namespace cookie in the socket C group program, and then iterate over only the filtered network namespaces that we want to kill the sockets in. So how about this to not design us into the corner, right? We can make this enum, like U32. By default, zero means current network namespace. Then you add another enum value, which means all the namespaces. And then eventually, once we have some identifier for the network namespace, we add it as another argument. And then the enum will be like specific namespace, right? Sure, yeah. Because, yeah. That's reasonable. Sorry, maybe I missed this. But can we use a file descriptor to the namespace as the argument? You can. Sorry? Stan says that's the default. I mean, default is the current names, network namespace, yeah. But I don't think you have a file descriptor to an. Can you get a file descript, like FD, to any network namespace, if you are sufficiently cis-admin, right? So then this file descriptor is an input argument. And you just default like zero to mean current namespace as a special case, similar like what we do with SPID. And then you can also default it to minus one to mean any or whatever, right? Then maybe we don't need enum. We just need this FD with two special values. So how is the FD value different from an enum? Because it can express any username space if you got the file descriptor to it. Like if you are some tool from like root namespace or whatever and just want to go over sockets within some specific container without like entering into that user namespace, network namespace and stuff like this, then you'll be able to do it. I guess it's just like kind of more universal API. I think the name FD is tripping me. Is it just like, should we call it NetNS or? NetNS FD, yeah. So the idea is that if you look into PROC, every process will have like a directory in it. And in that directory, you can find all the namespaces that this program is part of. And then there's also like a file that's like the NetNS namespace. You can open that file and that gives you a file descriptor and passing this file descriptor to the kernel that allows you to identify, this is the file, this is the network namespace that I'm interested in by passing something that looks like a file. It's just something, a way that we use to track like resources from user space basically. And the idea being like if you want to have a specific, like you want to look at the Cilium kind network namespace, you would have to find the, you know the right PID, you would have to open the network namespace and then you pass that to the iterator to say, this is what I'm interested in and you have the semantics that Andri mentioned, which is like if you have a value of zero, kind of we say that means the current one, right? For your own process or minus one being like any namespace. So like Enum and then like FD together, probably it's cleaner. And like zero, you default to like the current behavior like with all the zeros, right? So like Enum zero will be current namespace, FD zero means current namespace FD and then you can specify specific namespace or like a separate Enum like saying everything. And then like the file descriptor should be zero probably as well then. So just to summarize, I think we agree that it's beneficial to have like a global network namespace socket iterator. With respect to the API, Andri proposed that we have an Enum to indicate when the value is set to zero, it would mean current network namespace and it's set to one. It would mean all network namespaces and in between we currently don't have any option to pass that value, right? Because there is no like user space to kernel mapping. Okay. So I think it has been Enum, right? Cause zero is a valid FD and we use negative one to mean current in other. So technically you're correct that zero is a valid file descriptor but the ship has sailed long ago and the BPFU API treats zero as invalid. So like the LibPPF and like the Cilibri F library have code that basically says if the kernel gives you a zero file descriptor we do it so that it becomes non-zero. So for the purpose of this discussion, zero is like an invalid file descriptor. Is the file descriptor the better option or the inode? The inodes get reused for, yes, and it happens, I think it happens fairly quickly as well. Like if you destroy a network namespace and recreate it then it ends up using the same inode I think. I mean in BPFU API like file descriptor is the usual like identifier for something. Sometimes ID like for Prague ID and stuff like this but usually it's file descriptor. So it's consistent. Do we have a second topic as well? Sure, yeah. Just to close the loop on this. So I guess once we have plugged in the new file descriptor and this interlink info, we can plummet down to the init callbacks through this iter ox info struct that stores this additional extensible information. I have one question on, can you go back to the previous slide? Yeah, here, so assume we, I want to like iterate all the level namespace, right? So the BPI program will want to filter by level namespace for example. So here right now in the kernel is doing the net equal to see whether the pointer is the same. So how the BPI program can filter by level namespace? Is it by some ID in the level namespace or how? For example, I want level namespace one but not level namespace two. How can the BPI program filter that? So that's the part that's not clear to me. It's how we can map FD to this network namespace that kernel understands. But couldn't we pass like a pointer in there and then if the pointer is null, we go over all the sockets and if it has a specific value based on, if you retrieve it from a file descriptor, then it's probably somewhere in the private data, I would assume, then we can pass this in here and then have to net in as pointer or... How the BPI program get to the net in as pointer or any code key on the same API to do that? I mean like if we would be passing down a file descriptor like for the API that we just discussed, we would have to pointer, right? And otherwise if you, hmm, yeah, yeah, yeah. And then the other thing is if you really want to go over all of them, yeah, then what would be null? Oh. Can iterators call other iterators? Like I mean, the idea would be like to have a net in S iterator that calls into the socket iterator for that net in S, right? How about a new, new for E2? Well now you have for E2 number, okay, we have for E2 level namespace and then for E2. Okay, I think we can move on to the next topic. Okay, yeah, in the second topic, I wanted to discuss some extensions to the BPFC group infrastructure to support environments where we have this containerized workloads. So just some background to set the stage for further discussion. Starting with socket LB, Cilium mounts C group V2 FS and then attaches this, a set of program type C group socket at the C group root and I've highlighted the C group root because Cilium needs to be able to support socket based load balancing for connections that originate in the host network namespace. So the tricky part is that in the C group hooks, we don't have access to the source IP and port or rather the source IP and port fields are not populated when our programs are executed. So some new use cases have come up recently where we need fine-grained traffic control at the C group layer and since we cannot identify our parts, it's difficult to support use cases. Like for example, we wanna selectively skip socket LB for certain parts but not the others or policy enforcement where we wanna say, hey, pod A can talk to pod B or something like that and the last use case is tracing. So when the BPF SOC programs are executed, it does the socket based load balancing where it translates service VIP to service backend IP addresses. So we generate this tracing events and send them to user space and then user space need to attribute these events to the corresponding parts that generated these events. So I've sketched up a high level overview of the interaction between data plane and control plane. So since we use C group V2 FS, we can use C group ID as the shared context between control and data plane. So in the user space, we have the salient running that receives this events from Kubernetes control plane. So when parts are started, it maintains the C group IDs and C group paths for these parts and the containers that run within part. So part is just a set of containers and this containers share network namespace. In the data plane, we use this existing helpers to retrieve C group ID. So the first one is when you can get the currency group ID which is the ID that belongs to the current task. So in our case, it's the container C group ID. The second helper that's available is the ancestor group ID where as you can imagine, you can specify the ancestor level. So ancestor being C group that's higher up as compared to the current task. And so when, for example, for our tracing use case, what we are currently doing is in the BPFC group programs, we stamp events with the C group ID and then send this to the user space via ring buffer and then user space can then attribute these events to the corresponding pods or containers. So the problems that we have is that in Kubernetes, C group hierarchies are not consistent. So I've added snippets for example, C group paths here and I've highlighted some of the fields that vary between these paths. So the way Kubernetes constructs C group hierarchies is using this variable of fields like pod or QS or its ID or container ID and so and so forth. And all these fields are encoded in the C group path for a pod. So we currently don't have a reliable way of getting pod or C group paths. So we discussed this issue with upstream Kubernetes and what they recommended is that container runtimes can pass this pod C group paths to CNIs like Cilium. So with this, we have a way to get pod C group paths in the control plane. However, we still need to address this issue in the data plane and let's see why. So the BPF get current ancestor C group ID expects an ancestor level and this is what the description reads. Now the tricky thing is this ancestor level is computed with respect to the root. So let's go back to this example C group path. So as you can see in the first one, the ancestor level with respect to the root is one, two, three, fours. If you just compute it with respect to the root, it's one, two, and three, it's one and two. But if you see the second path, it's one, two, it's just one. And that's because the way Kubernetes encodes certain QS fields for certain pods and not the others. So this is what makes it difficult for us to use this API. So when I looked at this API first, my intuition was that it expects the ancestor level computed with respect to the current task, but that's not the case here. So before talking about the proposal, the alternative we considered is using C group local storage. But the caveat there is that local storage is associated with the corresponding C group hierarchy. So in our case, since we attach BPFC group programs at the root level, the local storage is associated where the BPFC group programs are attached and not at the P group, excuse me, part C group path. So the proposal I had was to be able to allow ancestor levels computed from the current task. So when operating at the container level, we just say, hey, I need C group ID for the pod, which is just one level up the current task. So here are some options. We could just extend the existing helper that could take negative levels to indicate that user needs, ancestor is asking for C group IDs with respect to the current task and not the root. Or introduce a new K-fung that deduce some of this functionality but also allows for different types of ancestor levels. Our final option is maybe we can just get the current hierarchy level for the task and then the BPF program can compute whatever C group ID, ancestor level that it needs the C group ID for. So the current behavior is like when you pass, what, zero or one, you get actually the root C group? When you pass zero, it's the root C group, yes. Okay, I mean, negative makes sense to me. Seems simple and natural. Okay, yeah, I agree. It's just that if we extend the existing helper, how do we determine whether the current kernel is supporting the new API or the old one? Sure. Yeah, I think that could work. Like you don't even have to like feature the tag, you can detect it at runtime, right? Like you just run the same helper, it returns you probably E and val on old kernels and like something meaningful on the new kernels, right? Yeah, it's a one-time thing. We could also use that. I guess the only complication was the runtime check is if you pass minus one and you need to make sure that you have at least one parent, right? Because otherwise, if you're in the root, you're doing minus one, there is no parent of the root and it's natural to return an error, I guess. There can be a different error value depending on whether this is support or not. What do you describe? Could work in C group namespaces? I mean, it makes no sense, right? So like on the new kernel, if there is no C group, that should be E no and so like minus one and you don't have a parent C group, then it's E no and it's not E and val. Okay, I'll need to check the code. Well, yeah, basically right now, if it cannot find the ancestor, just return zero. I guess we can return some new error code or maybe not, right? If it's zero, it's too late. So what does it return when there is an error? Oh yeah, okay, that explains it. Okay, so what about option three? I mean, it returns C group ID, right? So FFF could be a valid C group ID, it's a 64 bit. So it's possible, it's a valid C group ID. I don't know how it generated, but it's a 64 bit, right? Yeah, I mean, to repeat, yeah, the higher 32 bits are generation lower is IDR, so yeah, you can overflow into FF, FF, FF, FF. I guess, like, given all these complications, right? Then let's add the K-funk with error code returning long. It will be no and if the C group is not found, otherwise it will be, but if FFFF is a valid ID. Okay, we can probably take it offline. That's a K-funk too. I think I just look it up, that's a K-funk called BPFC group ancestor already. You added it? Yeah, sorry, I was responding to an email. Yeah, I honestly, I apologize, I haven't been paying too close attention because I've been responding to an email, but you can get the ancestor of a C group already. So you should be able to do that, yeah? Exactly what Martin said. Yeah, that's what the Sportacular Helper does. Oh, oh, yeah, I mean, this should be, yes. This should be super easy to do. I mean, you just, you could fail if it's an invalid index, but I think that should be trivial to the gentleman. It's still constant time, right? Negative you fail, but no, yeah, you return no. Yes. I mean, I think you can already do this, right? You can just get your, like, even if it's a parent ancestor, you can go to zero, that's the root, and you can just get whatever you need. I might be missing something. Right, okay. Maybe getting the current hierarchy level, it's easier, simpler, and we might need this anyway, so I agree, because it's simpler, we might need it anyway, and I don't think the overhead matters that much, but I could be wrong. Maybe let's go to the original plan, just like do the minus, like negative levels, but like for the feature detection, you will have to make sure that you do have the parent C group, then you run special program, feature detect, like whether this helper returns you zero or actual ID, and then like at runtime, you just use the global variable too. Sure. By the way, like we should probably extend the new helper that you added was the pointer to C group to have like the same semantics, right? To keep it consistent. Yeah, and it's subjective. I think negative is a little bit confusing if you want to do like a relative, but yeah, I think if we wanted to add it somewhere, we could just add it to the current K phone, and it should be fine, yeah, if people think it's useful. Cool, thank you. Yeah, thank you very much.