 Hi, my name is Vinay and I'm here today to talk about extending LibBPF for Kubernetes use case, which I hope you find interesting. The agenda for today is we'll look at this cloud-based development environment use case we came across this a year ago with a company called Octato, which is in the business of doing cloud-native development, offering a service to build your workload in the cloud. We'll look at the problem that they had and how EBPF helped solve this problem, and some of the pain points I encountered while implementing the POC for this as a solution. Based off that, I have a couple of suggestions, the proposal for extending LibBPF helper. We'll take a look at that and see if that makes sense. A couple of more use cases that I came across which could justify the need for this. So let's get started with that. Cloud-based development environment, this use case, so why does it matter? Do we need it? Consider this use case where today you have, let's say you have a dev team of 20 people or something. You want them to be productive in building, writing code and building code, and you buy them like M2 MacBook Pros, which is like 4,000 a piece maybe. So that's quite a bit of money. An alternative way to do it is buy these more economical laptops that are good enough for building code, for writing code using your favorite IDE, but the heavy lifting that's building the code and running tests, a battery of tests against this code. You can do that in the cloud in a Kubernetes spot that's running on a high-powered system. The advantage of doing this is that not only do you get the team to share the cloud capacity, not everyone is expected to be using it at the same time for building and running tests. So you get to share that, and then you also share the config. So you'll not run into the situation where it works on your machine, but you check it in and then it doesn't work at which time. So there is a value proposition there and based on which these companies are offering this as a service, and as far as it's picking up in popularity and as alternative to high-priced development systems. So the flow is essentially you write your code, asking the code, in this case asking the code to a Kubernetes spot, and the Kubernetes spot takes care of doing this, the heavy lifting. We presented this use case in the last North America, KubeCon, and there's a link to that if you're interested. Now this is great, but how do you tell Kubernetes that you want to have this dev environment in the cloud? The way to do that is by specifying, giving this pod spec to Kubernetes. I'm not sure if everyone knows this, so I'll just go over this a little bit. Pod spec, what you see here is it tells the Kubernetes API that this is the name of the pod. I want to schedule this. Okay, that's better. Okay, I want to schedule this pod in the cloud on a system that has at least four CPUs and five gig of memory. With that, Kubernetes is able to take that spec and then pull your container from the repository, image repository, and then run that workload. In this case, what I'm showing is a build environment that can build Kubernetes. Once it's up and running, this is great, but up until very recently, Kubernetes did not have the capability to modify the resources that you requested at the time of creation of the pod. If you have the pod that's running and no workload is being, there's no build or no test running, then that capacity is wasted. It's reserved for you and it remains wasted, which is a waste of resources and money. What's worse is, let's say you start running the test and you realize that, okay, five gig is not enough. I need six. Well, tough luck. You have to kill the whole thing and reschedule the pod that's six gig or more and then run it again so that again wastes time and money. Now, we recently fixed that. In Kubernetes 127, we have added the ability to resize the pod without having to kill the workloads that's running in it. So you can dynamically change the CPU and memory resources up or down based on the need. There is a project, there is a tool called Vertical Pod Autoscaler, which does precisely that. What it does is it monitors the CPU memory usage on an ongoing basis using some APIs and when it sees the workload needs more CPU or memory, it allocates, it makes a recommendation and is able to update that as well. And vice versa when it sees the underutilized pod. Now, this particular use case, the cloud-based Dev environment use case that I'm talking about here, it's different from your normal, like, okay, you have Nginx workload or database pod that's running and as the number of requests increase, then you need more memory which is growing more slowly. This is very spiky in nature. You can imagine that your request level is at one, you're not using the system, but the moment you hit a make command, you'll need a lot more memory and CPU. And by the time Vertical Pod Autoscaler reacts to this, we may end up in a situation where it gets own killed because you're exceeding your reserved quota. And what we ideally want for this is something like this, where you see the make command, you want to increase the capacity if it is available to what the build system or the tests need. And that way you are assured that for the duration of your tests run, things will continue to work well and things won't break down. We called it as the proactive approach. So this is where EBPF comes in. So let's take a look at this example here. I've intentionally set the memory requests here to 50 meg, which is sufficient for just running a bare-bones pod, but it's not useful to do any real work. If you want to build or run tests, this is not sufficient. And a very crude way of fixing that with EBPF is to, as you all know, monitor the exact VE system call, trace everything that goes through it. And then if you see the make command, you resize the pod. This little piece of code fits in one PowerPoint slide, but that's about the only thing that's good about it. What we really want to do is something like this. We want to trace the commands that we are interested in. In this case, use the make command here is the example. And only for the containers that we are interested in. So there might be 100 containers or 200 containers, 1,000 containers running in the system. But you're only interested in that one build container which you're going to build on. And you want to resize, it can be issuing a lot of commands. It can be doing a lot of things, but only when it does the make command, you want to use that, trace that event, and then use that event to resize the pod. Good question. Yeah. Sorry, I'm over here. So yeah, that would certainly work. Obviously, if you have a lot of other things running on the host, then you're just going to have CPU overcommit, right? And you're going to have issues. So this is something that SCET EXT could potentially help with. Like this is sort of probably more related to the overcommit, excuse me, the soft affinity that how and I were discussing yesterday where you can assign some CPUs to the C group or to the pod. But if the rest of the machine is underutilized, and you would just take those CPUs and you had like a highly parallel make job. Yeah, I think SCET EXT from what I've been more of a user of EBPF. So I don't know all the details to the depth that you all know. But my take was that it can help you make use of unutilized CPU cycles and give it to pods, but does that help with memory? No, no, I mean memory, no, you'd have to change that, but yeah. But yeah, I mean for your increase, you're increasing the number of CPUs as well in your example, right? So that was one of the aspects, yeah. I stole that slide from my co-speaker from Google, sorry. I really wanted it to be memory. Yeah, so I was just pointing out for I think the more general solution to make or whatever, if you wanted to increase a pod would be probably to do that on the scheduler instead of in the config, but it hasn't landed yet. So something maybe to experiment with though if it would be interesting. Yeah, I think it's good to have an additional solution there. Does that also like respect C group limits? Yeah, so the CPU control, so you can implement a scheduler that implements like CPU max and all of the CPU control. Yeah. That's what one of the example schedulers I mentioned, FlatCG, we do this recursive walking of all the C groups and we build a flat hierarchy so that you don't have to take the overhead of walking C groups when you're choosing tasks. Yeah, you could use that as an example. It's a little bit heuristic. It has some corner cases where it won't honor the C group limits, but you could do that. And then if that's what you want, that combined with a soft affinity would probably get you pretty far. Okay, so there is this exposes another, this introduces another new alternative to at least handle the CPU needs that I think that would be useful for at least another use case that's out there which relies purely on the CPU parts of it. But what remains to be identified is how do we expose this to the user, the user's ability to specify that this is going to be my min and max, it might change after a certain, I think the use case here was more based on deterministic, like okay, you know these kind of events, like hitting a main command will, you will need to resize the part. And I guess the same thing can be, yeah, I think that can be used to trigger the, the skelly XG as well, I can see that as a potential use case. You would still need to increase the C group limits, right? Because otherwise you'd just bypassing Kubernetes policy. Yeah, so I mean, I guess it depends on the use case. Like usually if you have C group CPU limits, it's because you don't want one C group to like overutilize the host and starve other tasks, right? But if you have something like soft affinity, it would only kick in if the rest of the CPUs were riddled. So I mean, you have to worry about memory bandwidth, like there's other implications to being able to run. But, and you could, yeah, you could always say, no, you can't use this. You can't exceed your limits. But I think the typical use case is like, you have like a CPU affinity in one C group and a separate affinity in the other, or you have, you know, maybe something less hard coded than that. And you just don't want one, like one C group with 20 tasks to exceed another C group with 10. You want to have like rough sharing of that. Let's take the line. But I guess my point is that like C, Kubernetes is the one that does the scheduling. So it uses C groups to kind of do the scheduling and control the resources. Yeah, I know. So conflicting with like Kubernetes heuristics and Kubernetes load balancing is a different problem. I'm just saying if you have, you know, maybe something less hard like I think the general solution to, I want this, this, this pod has capacity, it could run because the host has more capacity would be to use the scheduler, but your point is well taken as well. I think, yeah, if we want to go down that route, it'll be a significant amount of change to Kubernetes itself to switch from or at least extend the C group is, is at its max, but this, this could use more CPU and use caddxt program to give more resources to it. I guess maybe on that, so like what Kubernetes is exposing today, right, is like CPU or memory request, like make sure that there is at least this much bandwidth on the machine. And then you've got the max, right, which is like, don't exceed this, this limit. So those are like the very high level kind of constraints. Right. So if there's something different, we want to actually implement it doesn't fit those semantics. That's interesting. If this fits into those semantics, and you say, oh, this is like a piece of the cog to actually make that efficient. And that's also interesting. Well, so, okay. So if you want to increase the number of CPUs that a pod is using, would that feed into Kubernetes and then they might load balance you to like a different host or something like that? Yeah. Like if it like scheduling, like how many of different applications they'll then pack onto the different nodes. So, okay. So, I mean, it's, yeah, there's a few different ways to do it, but the easy, like the easy benefit of soft affinity is that you can, it's essentially like a, what's the word I'm looking for? It's like a, it's a lossy way of getting extra capacity without having to do any of that, right? Because like if you have, if you have a Kubernetes pod that's given like half of a host and another one's given the other half of the host, the point of soft affinity is that you could use whatever cores the other one wasn't using and you don't have to like, basically it's like a soft guarantee. Like if you really need to have more CPUs for the pod, then yeah, you would increase the number and you would get migrated, but for like a specific build job, like if the pod can do multiple things, it feels like it's maybe a little bit more robust to not have to like be migrated to a different host or whatnot, right? Like soft affinity, maybe it's like a, it's like a first attempt or something, I don't know, but yeah, if you wanted to, if you wanted to tie that in with a load balance or it would be a different story, I haven't even thought about how that would work. So, the load balancing or in this case the Kubernetes scheduling, it comes a layer before, the workload scheduling is different from the scheduling of the actual tasks in the workload, that happens even before the workload gets started up. The Kubernetes scheduler has a view of the whole cluster and its job is to look at, you know, find the best home for the pod that you're, people are guessing when they set this, they don't know what they really need. That's the main thing. And as far as CPU goes, doesn't today's fair scheduler also take care of this? Let's say if we, the recommendation is not to set limits, in that case whatever capacity is available in the system, let's say you have a 32 CPU system, but you requested four. Four is reserved for you, but if you go to like 16 or 28, 32, you will get that. CPU is a reclaimable resource, right? If you're, it's not like memory where you give it away and then it's hard to get it back until the process releases it. Yeah, yeah. But like if you have like an affinity of five CPUs and CFS is not going to give you any more than five CPUs, right? And so a lot of the time that's what people do with C groups, because a CPU controller is not very accurate and it's very high overhead. At least for Meta, that's what our experience with it. Are you talking about CPU sets like reserved CPU? Yeah, yeah, yeah. Like it would be a bug for a task that, like to be scheduled on CPU zero if it wasn't in a CPU mask. Yeah, that has not been in the scope of my work so far. It'll probably, I'm not going to do it, somebody else will do it. So yeah, I mean, I think this is like, if you have a pod that like traditionally uses five CPUs and it's scheduled with another one that uses five and it's like a, you know, a 10 CPU host, like soft affinity would be a way for you to be able to usually get all the, like it would be a way for you to not underutilize the host and get most of what you need most of the time, but if you need a guarantee of more CPUs, it won't help you. Yeah, I think there is a use case for stat, in Kubernetes world it's called static CPU manager policy where whole unit CPUs are affinitized, the implementation that's currently there is, it doesn't take that into account. It could be running on any CPU, it's fine. The workloads that need to be affinited like NUMA and all that there, Eric's and I believe if I remember correctly they wanted that facility and we kind of, to just keep the scope, I mean, manageable didn't take that into, into the scope of the project at this point or at least what got done so far. But yeah, future extensions, I can easily see this being a great, great thing because if you can, without having to do some whole lot of circus, this can easily, I can see that, honestly I need to digest a lot of this. Yeah, yeah, of course. We can talk about it more offline and there's also the CPU controller which is aside from the affinity thing that we could talk about if you're having, like if you're hitting issues in the controller with like limits and stuff like that, which isn't the same as the affinity then that could help as well. Okay, yeah, thank you. So, I'm kind of skimming over a lot of the Kubernetes stuff here and if you have any questions, please interrupt anytime. Where was I? Okay, so talking about how we leveraged eBPF, the trace points to do this resizing pretty much, call it just in time resizing. And the real, a better way to do this is to use, to trace only the commands that you're interested in, in this case, in this example it's make and only do it on the container. This way you're not flooding, it's much more efficient. The issue here is how to get the C-group ID given the container ID. And the one that you see up here, the yellow screen, the screenshot, this is for C-group V2, it's easier and this is for C-group V1. Now this is not a big deal but it's not a very trivial thing either. It might take some looking around and you don't want to repeat this code all over the place. And this is, I thought, is something that could be improved upon. And once we get the C-group ID, this is what we probably would do, not the best code but it's something like what we want to do to trace the points. We get the C-group ID, if it's in the BPF map, then you know it's a container that you want to trace and then look at the command that's being executed. If it's something that you're interested in, then you issue a perf event and then let the user mode program that's watching for these perf events handle that. So this is a much, this is probably the cleaner way to do things and as part of doing this, I did, at the time, I mean, you see that I'm trying to do this ugly string compare, that's because the library that I was using did not have BPF, STR and CMP. It's there now. I checked like a few days ago, it's there, so this is not an issue anymore. That brings me to the proposed BPF helper extensions. Given this context, what I think is a good idea is to add these two helpers. One is BPF get container C group ID. So this takes a container ID, which is in the C group file system. You under SFS C group CPU, you have this, this is what a Kubernetes container ID would look like. You scan for that and then return the iNode number, which is your C group ID. And it's inverse operation. I'm not sure if this can be, if there's something that can be useful for Aditya's use case that we discussed yesterday. I could be wrong, but I think the container ID is specific to the runtime. Don't they use like all kinds of different notations for this stuff? Yeah. So we have this code in Tetragon. We implement it just because we need it. Yeah. Is this the right way to do it though? I wasn't really sure. It's already worked across multiple distributions and so it's not necessarily the right way to do it, but as of now it's the only one. And like it's not great and it works. So the proper way would be to, so there is this, when you start the container, there is a specification where you can insert hooks into the runtime. And so if you can do that, that's a better way because like whenever a container starts, you could get each C group ID. So in Tetragon, we have support for that. Okay. But we also have support for that in case for whatever reason, the runtime hook doesn't work. Right. And so... What runtime are you using? What runtime are you using? Is this container D or... So the hooks are OCI specified. Yeah. And I've tested this on container D and I think it also works on cryo. Yeah, okay. There are some tricks to make it work on Docker, but yeah. I mean, I don't know if you want to, but you could just use Tetragon with a pro book and then you could also get the namespace filtering as well, right? Which I think will be useful, right? Like you may only want to do this in specific namespaces. Kubernetes namespaces is not kernel namespaces, right? Yeah. And there's filtering for that in the latest stuff too. So you could say, you know, if it's whatever, you know, preferred user namespace will give you more cores. If you're the least important user namespace, then you get, you know, maybe you don't get them or something, right? So it's worth looking at at least if you can't use it directly, maybe look at how we do it, I don't know. So the short answer seems to be that we should not rely on the C-group FS I node number. That's not necessarily going to be something that works for every use case or every runtime. So the thing I would say is that I don't think this belongs in LibbPF. Okay. Just because like LibbPF doesn't care about container IDs. Yeah. Yeah. I'd also say like it's, this is not the only Kubernetes thing you need. So does LibbPF want to start learning all of the other things that Kubernetes needs? So containers, namespaces, pods, labels, like as soon as you start adding this stuff, you end up adding a whole set of stuff. And I think LibbPF is about loading, LibbP, you know, programs, not about trying to understand Kubernetes runtime. Yeah, that kind of. The maintainers are here, so maybe they'll. Yeah, that thought occurred to me. This is like, whack the dog kind of situation. It actually looks to me like what you want is actually BPF side helpers. Yeah. So like you want to do this from the BPF side, from like the kind of kernel side, right? Yeah. It's like I want to make it easier for. No, like it's important distinction. Do you want to do this from the user space or from the BPF program? From the user space at this point. Okay. So right, your user space application is calling this to get the Seagrub ID. Yeah, so when you create the container, the flow is that the node agent, in this case, Kublet sees the pod spec that I showed earlier. And it calls into the runtime, this is container D or cryo to set up the pod sandbox. It's the first step. So set up pod sandbox, C and I add all that stuff happens. And the very first step is to create this infra container. And that's the Seagrub ID. We, well, that's not the Seagrub ID that we want in this case. But as we start creating new containers, one of those containers we care about, the build container in this case. That's the Seagrub ID we're looking to get. And. Why would this be in LibBPF? I mean, this is like purely, like what does that have to do with BPF, right? So this is a case where we have, no, it's Linux specific. I give you that. And in the Linux Seagrub files of the since FSC group file system, we create, there's a hierarchy for Seagrub v1, Seagrub v2, there are some differences, but there's a hierarchy and it's pretty predictable. And you can get the container ID there. No, but right, but you could just do that and like add a line like another library that like searches for Seagrub. I'm just genuinely asking like, what does it have to do with BPF, right? This is a Seagrub thing, not a BPF thing. It seems like. Yes. It is a Seagrub thing. No, no, sorry. Well, it is a Seagrub thing. What we are looking to do is get the, okay, I'm kind of a little lost now. So are these two separate things? The BPF portion of it is a consumer. The BPF code is a consumer. But to feed that BPF code, which we need the Seagrub ID. So you need to get a Seagrub ID and add it and like your BPF program needs to see that is what you're saying. Correct, yes. So you could just set like a variable in your program, right? And then just set it like open the program, look up the ID in the user space, set that variable. The BPF will do the relocations for you. How do you look up the ID? Is there another way to, given a container, in this case, let's say this path, SSFSC group, some path underneath that. Is there another existing way of getting the Seagrub ID just from that or? You can get that. So we have something similar in CELIM. If I understand your use case correctly, you can get part Seagrub path. And then to convert that path to Seagrub ID, you can use the SSSCOL call name to handle that. Oh, okay. And then you get the Seagrub ID in the user space. From BPF side, there are a couple of helpers that are available. I can give pointers to you. And so that way you can share the Seagrub ID context with the user space and BPF data path. Does that take care of both Seagrub V1 and V2? Or is this an electrical user? Seagrub ID is specific to Seagrub V2. So Seagrub V1, you'll probably have to use class ID or something. Right. So that's kind of the motivation for, have one place to get this and not have the user worry about or the, it's not a big deal. But this is something that's not, it would be nice to have in my opinion. Yeah, I mean, it certainly seems useful. It just seems like what you really need is a way to like, map ID, some user space, the kernel space, right? To the BPF program. So yeah, you just don't, I don't think this is like an abstraction that would belong in the BPF. Yeah, so it's definitely useful code. Like we have exactly the same code in TetraCon just for this exact same reason. Yeah. But like, I wouldn't expect like the BPF to care about it. Okay, I guess the BPF is not the best place to do it. I mean, the other thing, right? As soon as you run this for a while, you're going to go, oh, it's racy. I'm missing a few makes here and there. And then you're going to have token to OCI, right? That's exactly the path that TetraCon took, right? It's like, we did this. We tried it for a while. It was racy. We missed some things and we said, okay, we really need to be in the OCI path. And then LibBPF probably does not want to be in the OCI path at all. Okay, I think I kind of... So like, if you are, I don't know, writing this in Go, you can just like import whatever like helper we have in TetraCon if that helps. So is there another place where you're already doing this since it seems to be pretty common code for potential use cases? We have two at least that I'm aware of. So I think we do it in a couple of places, but like it's if you're willing to use Go, which is like the language that TetraCon is written if it's just like import the function then. It shouldn't be a problem. And if it's not public, we'll just make it public with some problem. Okay, sounds good. Well, there were just a couple of other use cases which I was wondering was worth the justification of bringing this up. And that's, I think we already had our Q and A. Thank you very much for all the solutions and input. I think I had to connect with a few of you folks to get more info on this. Okay, okay, thank you. Thank you.