 Hello everyone, thanks for coming My name is Pablo. I'm one of the funders and the CTO at Octeto Octeto is a development platform to deploy remote development environments on Kubernetes and we will talk about that a little bit later And hi, I'm Vinay. I work for future way technologies It's the research armor for parent company Huawei and I've been working on Kubernetes my interests are in Kubernetes network and computer and lately ebpf and We hope to have a good talk with you today. Thank you Cool. So this is the agenda of the talk We are going to introduce the idea of cloud native development environments The problems that they solve and also some of their challenges. One of them is to make them a cost effective So we will analyze the different challenges there and then we will introduce a feature coming soon It's in place pot resize and the idea with this feature is that you can modify the request and the limits of a running pot without restarting the pot After that, we will see a demo using in place pot resize and ebpf to optimize the infrastructure Utilization of cloud native development environments and we will finalize with some takeaways Okay, so let's talk about cloud native development environments and first. Let's see the current state of the earth so most companies are moving to Kubernetes and microservices So most companies are moving to Kubernetes and microservices for the right reasons, right? They solve many problems in production environments But they also come with some challenges and one of them is that They make make it harder to mimic your production environment in your local Dev environment Even with tools like Docker local Kubernetes distributions like mini cube You need to install this software run all these microservices So there are several issues there one of them is that you make run out of CPU and memory in your laptop So things go very slow or even stop working You Need to maintain local configurations. There is something wrong in your Dev environment. It's very difficult to replicate The same problem in other laptop or it's very hard for anyone else to troubleshoot. What is the issue? And if you are using for example multi repo, it's not trivial to orchestrate The build push and deploy of all these containers running locally So what most people do is to assume an environment disparity between Dev environments and production And I think this is wrong Because at the end what you are doing is shifting right your testing efforts Ideally a developer should be able to deploy a Dev environment and test end-to-end all the changes that they are doing before even sending a pull request but If you are not able to do that you need to send a PR Wait for the continuous integration job to validate your changes or even worse You need to merge and wait for an staging or integration environment to be accredited and then do the final end-to-end test there And if there is an issue you need to start again Work on your local Dev environments and appear and all those things so all this cycle Reduces the developer productivity a lot and also the developer happiness So the solution that we propose Is cloud native development environments and this is a high-level view of this methodology The idea is to have a single Kubernetes cluster served by all the developers and Every developer is working on a different name a space And you can have as much isolation you need between them a spaces using the standard Kubernetes objects So the idea is that on each name a space the developer is able to deploy a full replica of the application Which is much more realistic because this is running in Kubernetes with your hand charts Using your network configuration security policies the same thing that you do in production And it's fully replicable because it doesn't depend on your local configuration You should be able to deploy any commit from any git repository and everything happens in the cluster in the cloud So it doesn't depend on your local configuration and anyone in your team can go there and check it out because it's available for everyone So in order to Adop this methodology is very important that for the developer the deaf experience is the same and to do that there are several open source projects like Otheto the le presence till garden and many more and More or less the goal of these deaf tools is to provide this experience. So the developer Keeps working locally on their IDE And the application is hot reloading on remote immediately and you can even set Configure your debugger set breakpoints and all those things. So that's key for developer adoption If you need to change the developer workflow people is not going to adopt it, but With these tools you can have realistic replicable and Fast iteration remote deaf environments So that's the solution and I'm going to talk about one of the problems with this approach and is that You need to run all these environments in the cloud and it could be an impact in your cloud build, right? Which is scary But good news is that Kubernetes is very good To optimize resource allocation. That means that all the containers running in the same Note they may serve the CPU and the memory available in the note So for example, if you are building your application and you need an spike for CPU and memory The memory and the CPU of the note is available for your container And when you are done the same CPU and memory is available for other containers So that is very helpful to optimize your Utilization and also if you need more notes in your cluster the cluster will allow the scale up and down as needed automatically So that's good news, but let's analyze different use cases for development environments to see if this is really helpful in our use case so this is a workload in production and It's more or less using two CPUs all the time So we are able to set the CPU request to two CPUs And in this case is working very good If there are more incoming requests the deployment will scale of it horizontally. So that's good But for development is a little bit different what usually happens is that the when the application is booting You need more CPU and memory to put your to bootstrap your your application and then after a while In development, you don't usually have that many incoming requests, right? So after a while you need less CPU and memory to keep your application running So in this case there is already The vertical pod out the scaler in the Kubernetes community And the idea is that the vertical pod out the scaler monitors your container CPU and memory usage And then it will update the requests and limits of your containers based on the real usage of your container Here I'm assuming That in place pod resize is available because otherwise Here when the requests are updated the container will restart and then you would have the booting time Booting CPU again, but let's assume that in place pod resize is available But even with that when you are developing There are random spikes and they tend to be short in time For example, you are working on your application editing your code. You don't need too much CPU to do that But then you need to build your application with for example Make command so for the main command you need more CPU But the vertical pod out the scaler is reactive So there is a delay between just start using more CPU and Until the VP and the vertical pod out the scaler updates the request of the pot. So in this case I don't have Enough requests here to run my main command then the requests are updated, but I'm not using this CPU here Then the vertical pod out the scaler updated request again, and I don't have two CPUs available for the main command So there is a delay. So it's not really working for for this scenario Ideally would we would like something like this something that is proactive and updates the CPU requests of my container in real time basically or milliseconds So in our case when we are started up there, though, we basically run a dedicated cluster for every customer and when we started of that, though, we were using a standard Kubernetes request and limits and Using that we were able to run eight pods per note Which is not cool because We are using VMs with four CPUs and 32 gigabytes of memory and we were not utilizing all the resources So what we did is an ad hoc solution for this is a custom scheduler basically and with that solution We thought I don't have time to talk about but with that solution We were able to run 80 ports per note without affecting the developer experience. So this is huge is 10x infra savings and And there is a lot of potential to solve this problem properly our ad hoc solution is not ideal It requires a lot of effort to maintain the solution and to configure the solution for different customers But it shows as the potential of solving this problem So the question is if there is a better way to solve this and with that in mind I'm gonna pass to Vinay to talk about in place for resize and ebpf Thank You Pablo those great So in place for resize Can I get a show off hands? How many people here are aware that this feature is in the works? A few okay, then those few probably know that this PR is moving at light speeds Okay, well not exactly light speeds we're being extra cautious and there are good reasons for that the PR is big And it touches critical components and Kubernetes and mistakes can be costly So it's imperative that we stage it in a responsible way. What really matters is that we get across that finish line Hopefully in 126, I don't know We just released the continuity 169 which is needed for this and so this PR could merge any day now in the next few years If you guys are really nice to me, I will leave this PR to you in my will Okay enough grief for my PR. So let's take a look at what really changed The first thing we did was we made you have the container spec in that there's a resources field We made it mutable for CPU and memory when you do that What it lets you do is it lets you update send a patch to your part spec saying I want the container I started out with one CPU, but I want to that's you're expressing desired resources for the part and then Kubernetes goes to work doing what it does best, which is you know get the actual state equal to the desired state so With that with the ability to specify a desired state of resources We need a way to signal the user who's specified the request What's the status of the request and for that we introduced a new field called resize in the pod status? This status holds one of these four values when you have a pending request for resize The interesting one is in progress where which should be the default case That means that the kubelet was able to allocate the memory or CPU that you wanted And it's working on it with the runtime to make it happen Proposed as the initial state where when your API server looks at your request it make sure that the request is valid you're not exceeding like your requests are not exceeding limits and stuff like that and Infeasible is the user may not know okay there is a node that has four CPUs They may ask okay my part needs five. It's never gonna happen so that's a signal that you may need to Evict your part and ask the scheduler to schedule a new instance to another node where you can get five CPUs and Deferred is the case where the node has six CPUs, but another part is using two CPUs there and It's possible, but just not now. It's your choice. Do you want to wait or do you want to evict and get five CPUs elsewhere? We added a couple of more fields the resources allocated field in the container status It's a persistent way of telling what is it that the in the in progress pod status is driving towards and The resources field in the container status I know there are a lot of these fields in here But the resources field in the container status tells you the actual state as reported to us by the runtime So that's what is actually on the container on your containers Lastly we have a new field called resize policy. This was introduced for one reason There are some legacy applications where like Java applications which are using the xmx flag They're not able to take advantage of increased capacity of the pod without restarting. It's just the those applications need them So you have we want to give the users a way to specify. Hey, this is My application needs to be restarted The default is restart not required where we will try to we will try to resize your pod without restarting It doesn't guarantee. It's not a guarantee, but Kubernetes will try its best So what does it really involve? I touched upon this earlier. We have changes to the API server cooblet scheduler and runtime There is a lot to go into here, but I will focus on one thing the cooblet So cooblet admission of a resize request is interesting here because this PR introduces a new race condition where the resize request is racing with the pod that may just have been scheduled to that node and Cooblet is the ultimate authority and it will it's a gatekeeper It checks to make sure that at any point of time the requested resources is not exceeding what's available So one of the two requests might fail if there is a contention Going to one of the best ways to really look at how this works is to go through it step-by-step. So let's do that now Consider this example here what we have here is a node where you have a pod with 40 mcp use allocated to it Now you desire to give it 80 mcp use so it starts with a patch to the pod spec the API server validates this pod spec The patch and then updates the object store in this case We update the CD now the next step is watch is triggered for the scheduler and cooblet the scheduler Takes that looks at that watch and updates its pod cache and uses that to do the Do the max of desired and actual so that it doesn't compete with resize that's going that's in progress And in this case, let's assume that the cooblet was able to successfully allocate the the resources that was requested 80 mcp use so it does the admit pod resize it succeeds and It immediately patches the pod status with the pod status resize field saying I was able to give you 80 mcp use and The status of your resize is in progress and I'm going to be working with the runtime to make it happen now in this case we are increasing the CPU so the pod see group see see group settings are initial are Set first and then it goes and talks to container D the runtime Why are the update container resources CRI API asking container D to allocate 80 mcp use for the pods containers and now the container D goes to work it updates the configuration for the pod and Next time a container status CRI API comes in it reports back to the cooblet saying yes Your pod its containers have 80 mcp use that triggers a generation of update status update to the pod Where it patches the pod status saying the resize is now complete the pod has 80 mcp use and we have come full circle So that is the happy golden path now One there is a nuance to this that I want to touch upon your pod may have more than one container and You may be resizing more than one container at the same time You may be giving more resources to some and taking more taking resources away from other containers In all these cases what we do is we order the resize such that the decreases for container resize are done before the increases are invoked and If there is a net increase to the pod see group values due to the resize then the pod see groups are ordered first And then the containers are resized and why simmers are Why is this important? Let's take the scenario you have a system where you have two gig of memory available to the pods and You have one single pod running on that system It has two containers there they take one gig each and we desire to give C1 1.5 gig and cut memory for C2 make it 0.5 gig now the pod Hasn't changed the pods resources are the same But if we did C1 and then C2 we are oversubscribed and this request will fail and the pod will end up in a bad state It's for this reason that we want the runtimes to not only support update container resources CRI API but do so in a synchronous and transactional manner if we don't do this then The down downstream request to update the policy groups might fail and the part will be in a bad state And when I say synchronous and transactional what I mean here is that don't queue a task and say okay I will apply this later We need that yeah or nay whether it succeeded or it failed and if it failed What's the reason for failure in the context of that update container resources CRI API call? one other thing that got introduced to the To the container status CRI API is a resources field This is what allows the container runtime to tell us to tell Kublet what resources are actually configured on the pod I guess you're starting to see why this feature is a little complex, right? And Lastly there is C group V2. It's here New more and more OSS are shifting to C group V2. There are a bunch of desirable features to this For us in particular we're interested in the ability to specify memory requests at the container level We did not have that in C group V1. There is another talk. I believe it's on Friday It's by David Porter and Rinal Patel about C group V2. I highly recommend attending that if you can Okay, so this is the fun stuff we What can we do with eBPF for in-place pottery size as Pablo has laid out we have a problem here We have the use case here with the remote depth environments There is this workload where there are spikes and the reactive approach that we have with VPA is Not good enough. So How do we how can eBPF help? Let's take a look Consider this pod here. It's called kubil pod with one container kubil container and as you might guess we use this spot to Exec into the pod and edit code and build code It's a pretty good Example that mimics the remote dev environment except that in the remote dev you might do our sync of local code to remote What's interesting here is that when we look at the requests the resources that we have requested We're requesting for CPUs, which is good enough to you know edit code and build But we have curated and cherry picked a value of 50 meg for memory That is sufficient to edit code and edit code with a lightweight editor like vi But not enough to build if you try to hit the make command with this much of memory The oom killer will come along and take care of things It's gonna kill your process or it's gonna run really slow. So What can eBPF do? Well, let's take a look What you're looking at here is this 20 odd lines of Python code. That's all we need. That's it You go to the node on which the pod is running and tell Python. Hey run this for me and you're set Looking more closely. There are two parts to it. The first is the eBPF code itself This code attaches to the exec exec ve system call That's the system call through which all the commands in the system that are executed go through you do LS It goes through exec you do make it goes through LS it goes through exec and The second part is well when this When this eBPF program sees those commands it traces it it traces it to a trace file and the second part of this is that we Watch the trace file and if we see make in the trace file, then we resize our pod to 5 gig That's it. It's very simple, isn't it? Well, not so much This the only good thing about this code is that it fits on one PowerPoint slide it's It's very inefficient everything that you're doing the system is getting traced with trace file I actually tried it. It's really slow and it's of limited use So you're in your container. You're happy writing code somebody else doesn't make another container in your resize. That's not great, right? So let's do something a little less dumb for the demo. So BPF also offers this facility of maps Which is a way for Talking to the eBPF program and having it be more configurable than this What we're gonna do is we're gonna use the BPF maps to tell the eBPF program to focus on only specific containers And in that container focus on specific commands. How do we do that? The containers have C group ID. So we specify the C group ID as a key in the eBPF in the eBPF maps and In the value we specify the list of commands that we're interested in tracing This will let us only trace the commands from the containers that we are interested in That's good, but how does the user tell this to us? Well We resort to the good old annotations We've defined annotation called eBPF resize which contains C name the container name that we're interested in and The commands that we're interested in in this case make and of course what do you want to resize to? So we specify these three things with these three things the pod watcher thread or process can configure the BPF maps And the program can trace exactly what we need With that all we need to do is initiate the resize when we see the trace and Of course, we are lazy Kubernetes people We are gonna go to each node and say hey Python run this code. Hey Python run this code We're gonna ask Kubernetes to do it we hand it to Damon said and Kubernetes does the hard work for us. I Have code up on GitHub. It's demo code only so don't run it in production So you can take a look at it and see how this works. It's a prototype It brings us to the showtime Okay, I'm gonna switch to the The screen here Can everyone see the the terminal screens? Okay all the way in the back. This is the hardest part to get okay great Okay, what you're looking at here is is a local all these terminals are SSH into a local VM called eBPF resize and We're running a local Kubernetes cluster a local cluster here with in-place pod resize feature gate enable That means we can resize our pods without restarting the power containers The first step is to of course deploy the build pod and I'm doing that here with this YAML deployed It's gonna take a moment. There it is. So we have the pod running here the could build pod We can execute into this part eight code and build code and we have this Window up here shows the stats. It's using it's using 600 It's using under one a be a memory and then you have we can do one more thing here We get the cube build pod in json format and query the status resources allocated the new field that we introduced for this feature that tells us What the cubelet has allocated for your pods pod in its containers in this case? We are focusing on the build container that's container zero now Let's take a look at Let's edit some code so I'm gonna exec into the build pod and LS this is the Kubernetes 125 release branch that I've gotten the pod for this demo Get status shows us that if you're looking at the memory values. We are at 42 meg It's under what we so we have 50. We're under that and we use lightweight editor like Wim and Edit some code So this does not look right. Kubernetes also known as k8s. We need to fix this. So let's fix it All right, that looks much better, isn't it? Okay, so the memory usage is 30 mag. We're good. So what happens when you try to build this? Let's try. There's one way to find out So it's gonna run really. Oh, so it got killed. Well, that was fast This is expected, right? We have 50 mag and make is trying to allocate more than 50 and we just don't have it And the own killer came along came along and pretty much told us to go kick rocks So what can we do? We have choices. We have options. We can use We can schedule the pod with 5 gig and you won't have to worry about it. Of course your bank account won't be happy and You could use you could rely on VPA when VPA sees these womb events out of memory events It will resize the pod for you, but it's not great developer experience So there is another way fortunately, let's take a look at what eBPF can do for us BPF tool map list just shows that map entries. I'm gonna use this later But first is to deploy the demon set. So with this command, I'm gonna deploy the demon set It's gonna show up here. It takes a moment to initialize now It's running we can know that the program is up and running by looking using the BPF tool to list the map entries So I'm gonna run that again. So there is a map entry here now We can get into the details of this by doing a dump of this map entry 67 So what we see here is a key one four seven five two That is the C group ID of this container in which I just executed the make command which you know caught killed by own killer and It's telling the BPF program. Hey, if you see this container execute a make command then please trace it So now let's try this again. Let's see if it works. Oh Please keep an eye on this window the top left top right corner window with the 15 meg value That's being watched as I hit the make command. Here we go So we're at 5 gig and the container seems to be happy It's allocating whatever it needs and the make is making much better progress than it did before So that my friends is the magic of ebpf for you Okay, let me take a moment to thank the demo gods here for saving the surprises for another day and So what did we what did we see today? We saw that? We have this use case as Pablo described where make spikes that happen in the development environment They need a much more responsive resize and that's not always possible with that what with what we currently have and We saw that ebpf programs can help here by almost instantaneously resizing the pod for you to your needs and So to recap We feel that cloud native development environments are the future and because they're cost-effective there They give you a production like environment and all teams work on the same config which saves on testing costs and production issues we we Want the runtimes to be able to support? In-place resize if you're working with runtime or if you're a maintainer Please consider adding support for in place resize to your runtime We don't want the user to have to worry about whether whether this works with my runtime or not We want it to just work. It shouldn't even be something that people think about that's where we want to be and Try this feature gate out when this makes it in please turn on the feature gate hammer on it beat up on the feature We want to find those corner case issues. It's important to find it's important to handle the use case as well but it's also important to gracefully handle the abuse cases and lastly try out ebpf Look, I'm I'm no ebpf expert Not even close, but we managed to cobble together this little improvement in a short time and This improvement got us from being Reactive to being proactive for this use case it took us from you know tens of hundreds of seconds of response time to like a sub second response time and And that little improvement is actually a paradigm shift now And it's great when small improvements lead to amplified gains. So yeah, try out ebpf. It just might solve problems for you We have one more slide to go and It's the one that everyone is waiting for The title says it all right so Please scan this QR code It will take you to a place where you can live you can leave feedback for us and please tell us what you found useful and more importantly what we could do better and We love to hear from you. We love the feedback So and thank you for being here. Thank you for being such a great audience. It's been a pleasure With that I will conclude this talk