 Yeah, so hello everyone and thanks for coming to the talk and to the conference. So I'm Martinas, we are both long-time Cilium contributors and besides being passionate about Cilium and EBPF we are also passionate about skiing. So yeah today's talk is going to be about some changes to the Cilium's control plane, also known as Cilium agent and in particular it's about improving the resilience and fall tolerance of Cilium agent which is one of the most critical components in your Kubernetes clusters. So let's start with the demo. Yeah, so we have a Kubernetes cluster, basically running some clients, some server pods, all of them communicating to a cluster IP and all of a sudden you get paged. Something is broken, like some apps are not working. You start troubleshooting and I highly suggest you don't do this bottom-up approach where you start disassembling BPF programs or extumping BPF maps but you start with something more reasonable like top-down approach where you start looking into Cilium dashboard. And in this dashboard we see that there are some drops happening and the drops are happening because of the service backend not found. So something related to service load balancing. And then if we scroll down, yeah we see the metric is going up so then we scroll down and we see that some agents are reporting that the state is being degraded so something is wrong. And then if we keep going through the dashboard we see that some of the BPF maps which are responsible for services got full. So this is quite typical scenario and we see this happening with policy maps for instance and sometimes with service maps. And then the component which is responsible for writing to those BPF maps starts to report errors. We see that our counter is going up over the time and we see that there were recently lots of update operations. So this operation duration for the service for update. And next we use a bit of a low-level tool. We dive into one of the Cilium nodes and we dump the status. So the health information and then we see that one reconciler which is responsible for service maps updates reports a failure. And then let's look into the Cilium state. So we'll be talking a lot about this new components called state DB in this talk. And we can see that the desired state is not being realized for the service maps. So we clearly see that the BPF map is full. So this gives us a lot of hints that we might have too many services in the cluster. So either we need to increase the BPF map size or delete some. And for the demo for simplicity let's delete some services. It's pre-recorded because a lot of typing is needed. Okay and then let's go back to the dashboard and we see that let's wait a bit and we see that the agents are in okay state and we see that the maps are no longer full. And we should see the drops decreasing. Yeah. So cluster is getting back to the health state. Okay let's go back to the talk. So before we dive into the details how pro tip for next speaker charge your laptop. So okay so before we dive into the details how this was implemented. Some historical context. So Selen started as a simple CNI with observability and security in mind. And one differentiating technology was BPF. And with BPF we were very flexible what we can do. We can basically take any network packet and do any modifications we want to do. So then after solving the CNI problems we started looking into Kubernetes generic networking problems and you probably remember this picture from a long time ago. And what happened next is basically imagine bunch of very curious and motivated engineers. You give to them this powerful tool BPF and then also give to them very interesting problems in the networking space. So of course all of them will get eventually solved. So that what happened to Selen and I would probably claim that the majority of networking problems are solved today. Kubernetes networking problems are solved today by Selen. And the nice thing about it that in your cluster you can just run a single component. You don't have multiple different components and making sure that they nicely interact with each other. But the downside of that that the project became quite complex. But if you look under the hood of Selen you see that's fairly like the workflow is fairly simple. So we have this agent running on some host getting updates from Kubernetes API server from some other components like Selen's cluster mesh parsing those updates and then writing to a local BPF maps which are used by Selen's BPF data plane. And of course many things can go wrong. Like we might lose the connectivity to API server. API server might get overloaded or even assist calls to the BPF infrastructure might fail. So for instance a host runouts of memory for a short period of time. So yeah I'll give a microphone to UC who will talk about how we deal with those failures. All right so we're kind of zooming into the bottom most picture here. How does the agent manage the reconciliation into the BPF maps? So let's take a simple example of Kubernetes API server giving us an event about a Kubernetes service being created. And then we want to write it into the services BPF map. So one thing the agent could be doing here is just directly doing this school writing in a reduced event. But this doesn't really work out since well the map might be full and might not be able to do this operation right then and there. So we might need to retry it in the future. So we need something in between here. So we need to store the event in some data store and then shadow the reconciliation separately. So if we kind of look at the architectural picture here. So this desired state kind of divides things into two. We have a on top control plane which receives the events from different sources, API server, cluster mass, API server, REST API and so on. It digests the events and computes the desired state. And then on the bottom we have the part that then applies the desired state into the system. And many parts of the agent have been written in this pattern and in many different ways. So what we've been exploring is like how do we unify this and how do we get observability out of this? How do we know things are failing? How do we extract health? So we're going to talk about how to implement the desired state in a way that we have good tools for inspecting it. And then we're going to talk about later about the reconciler and having a single implementation for it allows us to extract health and metrics out of it and get a better picture of what's going on in the system. So jumping into the desired state, how would we implement this? So it's an in-memory database. So instead of kind of traditionally writing things, write your hash map logs, your subscriber patterns on from scratch, instead we go for a database approach that does these things for you. It's transactional, which allows doing writes in a single transaction, which can then be observed as a batch by the reader, which improves throughput. Everything stores immutable in state DB in immutable trees, which allows for a lot less readers. So then once we add more and more code into this, we're not worried about readers holding logs too long or doing crazy things in callbacks and stopping progress from happening. And everything is revisioned, so it allows us to query things as they change over time. And then there's a way of seeing if things change. So there's a channel-based notification mechanism, which tells us when something changes and then allows us to wake up the reader and do more work. Internally, it looks like this. So you have Radix tree, things are kind of prefix indexed. So then if you search for something, you do a bite-wise lookup. And then on modifications, you do a clone of the change part of the tree. So then when you're inserting something, existing readers are never affected. So I'm going to give a quick example of how does the code look like to use it from go. So to create a table, you first start by defining your data type. And this can be anything from structs to anything more complicated. You define your index. You define two methods, telling the state DB how do you crawl from your custom data type into the internal database key. And keys here are essentially just bite slices. And then the second thing you define is how do you go from a search key, which can be, again, any data type you want, any type of query you want. How do you go from that into a database key? And then finally, you can create a table giving it a name and the primary index and optionally any number of secondary indices. And then with the table created, we can do queries. So for example, if we would have a table of services and we've indexed it by namespace, we can do a query using the namespace index and get all services in the default namespace. And we get back an iterator, which hopefully soon with rangefunk proposal gets even nicer. And here we get our service object and we get a revision. So we can use the revision to see if something changed. And then we get a watch channel, which when closes tells us that now this query you did is invalidated. There's something changed in this index and you need to query again. So with that, we have a common API for building all sort of useful tools that are not dependent on the data type itself. And one of them we can demo there, the reconciler. So I'm going to jump into the reconciliation topic now. So reconciler is a reusable utility. You point it at a state db table and you tell it what kind of things to do with the objects. So for example, if we have a backend object, we need to define the update, delete and prune operations for it, for the reconciler to work. So for example, here an update operation could be simply a ppf map update, for example. And a delete, just a map delete or at least could be more complicated multi-step operations or it could be a network call and so on. And let's look at how failures now would be handled. So let's say our ppf map update fails. We get an enos space from the system. This would be then propagated upwards. The update method would fail. Reconciler would then shuttle a retry at a later time. It would then update metrics and health. Like we saw in the demo in the beginning, we saw the error count going up. There was something failing. Then the agent health would be reported as decorated. The reconciler cannot reconcile all the objects. And then finally, the failing status would be written back in the state db and that can be inspected. We saw that quickly in the demo. And here is it again. So we can, through Citium Debug, look at any of these tables and dump them and we can watch it over time. So that's also something that revisioning gives us that we can watch for new things to arrive. So in principle, this is very, very similar to what you get with Kubernetes as well. Very similar concepts and APIs, just internally in the agent. And you have the same flexibility for adding new components to it and the same guarantees that you are not going to interact with other components in the system. All right. We're getting into summary. So what we showed here was example of the infrastructure. So we saw state db and the ability to inspect it. We saw the reconciler in action. And we saw the agent health, which is also a new component. And in 116, we're going to have first use case for this. So a lot of work has been done around device detection and node addresses. So a lot of the issues are being resolved where agent wasn't reacting to device and IP address changes. So these are now being solved with these tools. And yeah, new features are highly encouraged to start leveraging this to get better observability and resilience. All right. Thank you. And at the bottom, there's a link if you're interested in how this is implemented on the details. You can look at the repository and there's an example there that shows a fairly complicated example there. All right. Any questions? Good morning. Good morning. Thank you for the presentation. And the question about the state db, it's an imaginary database. I'm interested to know how you guarantee consistency or availability of the data you store in the state db. So it's really meant to replace kind of handle stores of state within the agent. So it's a replacement for hash maps and protected by mutics and notifications mechanism around that. So we have the same requirements there. So it's all in memory. So if the agent restarts, it rebuilds the whole state from scratch. And we basically persist the state in BPF maps and each time Selim agent restarts, it restores the state from BPF maps, QPAPI server and other inputs. So it's not meant to be durable store. How is the memory usage compared to previous versions? So it's going to be very similar. If you look at, for example, the Kubernetes objects we receive, we're always un-martialing a new object when we receive, so we always have a new copy. So where the difference might come is if there's a lot of readers holding on to old snapshots of the database, that's where we might probably holding on to more data. But it is comparable. What about consistency of the state db between Selim agents? How is that done? So that's all true QPAPI server or KD store. So none of those mechanisms change. So we're still eventually consistent there. We're receiving events from outside and then reconciling. So there's no requirement for cross-agent transactions and things like that. Although you could plug in graphed into it and so on, which is basically how Nomad works. So it's very similar to that. So inside the agent, we didn't mention, we didn't emphasize it, but basically also allows us to establish some communication patterns between different components, sub-components, and Selim. So basically components can communicate through the state db. So we have a bunch of reconciler loops and they're just watching some table updates. And if one component, let's say, let's take the service example. And for the service, we show the inputs are QPAPI server, but there might be, I know, some feedback loop coming from Envoy. So that feedback loop can write into the database and the service manager will see some updates and react to it. Because right now, we have a bunch of components and they are communicating in quite different ways, like logs, go channels, callbacks, etc., etc. So that could establish the communication, like big message bus for Selim agent. Thanks a lot.