 Thank you very much, Liz. So as Liz mentioned, well, hey, everyone. My name is Maxim Visano. And today, I have the immense pleasure to be alongside the one and only Hemant Smala here. And we decided to stand in front of you today to talk to you about our experience with Cilium over the past few years. And we have a lot of content for today. So I'm not sure we'll have time to answer the questions from the stage, but feel free to reach out to us next to it, or on the Slack community channel. You can see our handles over this slide. So Hemant and I, both our engineers working for a company called Datadog. And over there, what we do is that we operate Cilium on a daily basis. And in case you don't know about Datadog, we are a SaaS observability company with a platform, sorry, with about 5,000 employees amongst the globe. Our infrastructure is solely cloud-based. And we are spread amongst several cloud providers. The reason behind it is really driven by our business. Our customers want us to be serving them from wherever they operate themselves. And as you can see here as well, what we attempted to reflect is that most of our workloads are scheduled onto community clusters. As you can see, at Datadog, we share a common interest into hexagon with the EBPFBs. And in terms of numbers, we have several hundreds of community clusters, over tens of thousands of nodes. And we are running several hundreds of thousands of pods. And without too much surprise, I guess, if we're here today, we choose Cilium in order to implement our Porn Networking implementation. We did that use all of the features that Cilium has to refer, though. First and foremost, when we started looking into it, our primary goal was to solve the IP address management and container network interfacing needs that we had. We went straight away with the native routing implementation as we evaluated the overlay one. And it wasn't really fitting our requirements well. And secondly, what we wanted to do is to benefit from the Kube Proxyless approach to load balance our cluster IP-based services and also being able to see all the EBPF-related perks we would get along the way. And then, from a security perspective as well, we were also very interested into the implementation that Cilium does for network policies. Time-wise, as you can see here, our journey with communities started around five years ago. When we started, we didn't go straight with Cilium. We evaluated the respective cloud service providers offering, but we quickly realized that this would not go very far with our upcoming needs. So we shifted about a year into it with Cilium. And a year after, we started implementing policies. And it's been about two years now that we have decommissioned our last remaining non-Cilium operated cluster. So about to that, then. So getting started with Cilium is amazingly simple. We've seen that over the past few presentations. You are only one liner away from getting the network packet flowing for your pods. But on day two, though, whether your goal is to keep the light on or to expand your estate, there are many knobs, many settings, parameters that can be tuned in order to keep those packets flowing. So which is why today, Amos and I wanted to really retrospect and reflect over the various one that we found towards our journey. And please bear in mind that everything that we are going to tell you are really in the light of the needs that we had, the challenges, the experience that we've had over the years. But it doesn't necessarily mean that it will or does apply to you. You have your own setup. You have your own requirements. And so we really are going to focus onto the knobs himself and not necessarily the values we've set them to. And also, most of the challenges that we are going to represent are really there to highlight how efficient and how much of a synergy we can have with the Cilium community in order to overcome these issues. And most of them are sorted. And to start this up, then, I will let Amos get you through what we could discover around the IPAM. Thanks, Maxime. So as Maxime mentioned at Datadog, we use Cilium for IP address management. For those of you who already run Cilium, this might be very familiar to you. But as a quick overview of how that architecture looks like, so Cilium primarily has two components, right? So we have the Cilium agent, which runs as a daemon set in your Kubernetes cluster and runs on every single node in your cluster. And it's responsible for doing things like installing BPF programs, allocating IP addresses to your parts, and things like that. And there's another component called Cilium operator, which runs as a deployment in your cluster. And it's the only component that is allowed to talk to your cloud provider APIs and do operations like create network interfaces or allocate IP addresses to those interfaces. And the Cilium operator and Cilium agent talk to each other through a CRD called Cilium node. So one of the first challenges we run to, so one of the things we really like about the operator model is it allows us to standardize a lot of things across cloud providers. We can do things like centralized rate limiting compared to using cloud provider C&I models. So we can centralize all the rate limiting in the Cilium operator itself. And it also allows us to have improved observability. And it also allows you to use common abstractions like pre-allocate, min-allocate, and max-allocate to maintain a certain pool of IPs available on every single node. So one of the challenges we ran into initially was how do we trade off between IP way stage and allocation speed. So as Maxi mentioned, we run a flat network model. That means we have a tight IP address space. So we cannot afford to really waste any IPs on any of our nodes. So we run all of our nodes with pre-allocate set to one. But the trade-off here is that if you have nodes, if you have a node where there is more than one part scheduled on it, the IP allocation latencies go up. So one example is imagine a node that has, let's say, 30 part scheduled on it. And the node has to be replaced because there is a cloud provider window or something like that. Cloud provider replacement window or something like that. And in this case, the operator would take a long time to allocate all the 30 IPs. So we introduced a feature called surge allocate. And with surge allocate, what the operator does is the operator maintains or the operator is aware of all the pending backlog that's available on the node and it will go ahead and allocate all the necessary IPs in one go. So this improved the allocation speed by a lot and this has been available from Celium 111. And the next challenge we ran into was network address unit limitations in VPCs in Amazon. So in Amazon VPCs, the default limit of total network address units is 64K. You can request Amazon to bump that to 256K, but you cannot go beyond that. So in order to solve for that, Amazon introduced a new feature called prefix delegation. Using prefix delegation, instead of allocating slash 32 IP addresses on your ENAIs, you can go ahead and allocate slash 28 prefixes. So this effectively allows you to allocate 16 times more IPs on every single interface, depending on the instance you're running on. And not only on your instances, this allows you to have more power density in your VPCs as well. And at this point in time, Celium did not have support for it. So we worked with the community and it's been available from version 110. And I'll let Maxim talk about how we managed it once. Thank you very much. So as I mentioned earlier, one of the greatest thing about Celium is that it's super easy to get started. And it's deployment process is really something that is taken seriously by the community. You have various ways to deploy it, whether you use the CLI as it has been shown I think earlier, or you leverage the hand shots. And whether you're looking to deploy it for the first time or upgrading a large production clusters, there is tons of documentations at your disposal in order to pre and mechanisms as well, sorry, in order to prevent you from breaking everything up when you do so. So once again, the aviation industry has a ton of interesting procedures for us to inspire ourselves. And which is why as pilot do before takeoff, the first thing we choose to do before deploying Celium is to run the pre-flight checks. So the pre-flight check mostly consists for those who are not aware of a set of subcommands of the Celium CLI, which can perform various verifications before you attempt upgrading Celium on your clusters. So it doesn't really matter on the initial deployment, although as you are upgrading sensitive production clusters, this can be helpful. So there is one test in particular that we are very fond of, which is the Validate CNP1. And basically what this test does is that it looks at all the Celium network policies that are present on your cluster. It's going to iterate over them and ensure that the version of Celium you are attempting to upgrade onto is not going to be misbehaving with the current state of these policies. And in our case, it was particularly useful when we attempted to upgrade from Celium 1.8 to 1.10, I think, as we had many surprises that got highlighted through this test. So all the CNPs that were configured on our clusters were not actually providing us with the permissions we believe they were. And the main reason is that prior to Celium 1.10, it was a very limited set of validation being made over the Celium network policies syntax. And this led us to have like hundreds of broken policies amongst our fleets. So first, what we discovered is that we had multiple indentation mistakes. And basically all those policies were still being accepted by the Celium agents, although not being interpreted correctly. They were interpreted correctly, although they were not really matching anything. A second thing that we discovered is that some users were misusing the keys that were necessary for them to really target some particular entities, some particular endpoints. And some results there, the policies didn't have any effect at all. So this got caught by the validated CNP ones, although this is not a silver bullet. Like they are some things that do not get caught. And this is not really a bug, but a feature. Like for instance, in our case, we had some users which were leveraging features of the Celium network policies spec that were not yet implemented in Celium 1.8, such as this example here, Ingress Deny. And what happened then is that whenever we started rolling out Celium 1.10, the agent in the upgraded version interpreted that correctly and we started dropping packets for those users. So how do we use this? So as I mentioned earlier, there are various ways to deploy Celium, or sales we choose to leverage the hand shots as well. And that way, yeah, we don't have to worry too much about all the glue that is necessary to deploy and roll everything out. So enabling the pre-flight check is very simple. It's basically a setting, a variable that you can set to true and you're pretty much good to go. But although as I mentioned earlier, this is not a silver bullet. And secondly, we operate over hundreds of clusters. So just sticking to those two things would probably lead us to easily break everything up. And we would, if we did not look any further. So this is where connectivity test comes to play. So as for the pre-flight checks, the connectivity test is also a set of subcommands that is available from the Celium CLI. And what it does, what this feature does is that it can actually spin up some test pods that can simulate connectivity scenarios that you want to assess, whether pod A can talk to pod B, pod A can talk to a cluster IP service, or an external IP address. And in our case, what we decided to do is to leverage the Helm rollout process and include those tests as part of our Helm charts definition. So we leverage for that the post-upgrade, sorry, hook that allows us to ensure that whenever we're going to roll Celium out amongst all of our clusters, these tests are going to be ran and therefore have feedback before we roll it out to many clusters and break everything up. So if you do those two things, you are likely to be covering up a very good percentage of the things that can break and at least be aware that things are being broken as you roll Celium out. But that's in theory, right? In practice, evolving at this scale, you're always gonna have some surprises and that particular cluster, which is an outlier that doesn't behave exactly like the rest of the fleet. So here is an example of such of a situation that occurred to us quite a while ago. So what we decided to do on that day was to change a value of flag on many of our clusters. So this was concerning AWS-based clusters. This was a flag on the Celium operator. And what this flag does basically, it instructs the Celium operator whenever it starts to pull the list of all the available instance types that are present in the particular region that the operator is running at. And for various reasons, we wanted to disable that feature. Sorry, and yeah, and it does so to be able to know how many ENI can be associated for given instance types. And it worked great. We rolled that out. Great, but on one cluster. On that particular cluster, what we discovered is that new pods were not being able to come up properly on any node that didn't have any IP available on their pool. We started to be fed with these logs, like the Celium agents were throwing, waiting for IPs to become available and suggesting us to check what was happening with the Celium operator. On the Celium operator side, we couldn't see anything odd from the logs. So this is where starting to leverage the debugging capabilities that are at your disposal can be helpful. So you have various ways to debug what is happening within your Celium agents and operator pods. So you can, for instance, leverage the Celium bug tool command, which is very efficient and can provide you with a lot of information about what's happening within your go processes, as well as a lot of Celium-oriented details that can be very useful as well if you are requesting some help from the Celium community. Otherwise, what you can do is enable the PIPRO flag on the Celium agent or the operator and leverage the base go tool PIPROF tooling. So luckily for us at Datadog, we are an observability company and one feature of our product allows us to actually store and restitute some of this data over time. And this can be particularly useful, for example, when you are attempting to assess the performance of a particular piece of the code pass from one version to the other. So yeah, so forgot to mention. So this is called continuous profiling. This comes at a price, though. Continuous profiling involves some overhead. And in our case, we decided to not enable it by default on all of our Celium agents and operators. So in order to overcome this situation, we came up with this idea of having an idea or hack, define it as you wish. But what we did is that we have a dedicated demon set that is configured to be pinned by a node selector. So that way whenever we want to troubleshoot a particular node, we just have to set that label and the Celium agent is going to be restarted with the correct configuration. So back to this example, why did we have this particular issue on that cluster? So the root cause in the end was the fact that on that particular cluster, we had the specific instance types that was only present in that cluster. And the current version of the Celium operator we had was not aware about the ENI spec of that instance type and was entering a deadlock situation, not fulfilling any other requests and being stuck in that limbo state. If you're interested to learn more about this, there is the PR Attach to it. And yeah, so I talked to you about how you could easily dig into problems and how you should not hesitate to investigate. But before you know there is a problem, you need to have sufficient health and observability to figure this out. So now I'll let Emma talk to you about that. Thanks, Maxime. So similar to how we monitored the health of the services we build, it's also very important to monitor the health of Celium's control plane, data plane and all the external dependencies Celium has. And for those of you who run Celium on the cloud, the most important thing you can monitor is rate limits with your cloud providers. And here's a screenshot from one of our incidents where the Celium operator was getting aggressively rate limited to an extent that almost 100% of our API calls were being rejected by AWS. So we took a closer look at how many calls we're actually making and we realized that we're actually making around 1000 API calls per second. But this is not supposed to happen because Celium operator has a feature called client side rate limiting and we have a QPS and a burst value configured. So there's no way we could have made a 1000 API calls per second. So turns out there was a bug in Celium operator which resulted in the client side rate limiting being completely bypassed and that resulted in us in Celium operator making around 1000 API calls. And the fix for this has been available from 111. And at Datadog, we run Celium in KV store mode with HCD. It is recommended that you use HCD in clusters that are of large size. And this time there was another bug but we were actually upgrading from a version which had the bug to a version where the bug was actually fixed. And sometimes this also can be tricky because this bug was actually impacting the KV store package and the version we are upgrading to was actually starting to enforce these rate limits. So what this ended up causing is the Celium operator was unable to garbage collected Celium identities and as a result, it was taking a long time for new pods to get their own Celium identities. So after this incident, we started to closely monitor all of our interactions with KV store mode so that we can stay on top of what's happening in our clusters. And sometimes monitoring for simple metrics like agents and crash loop back off or operators restarting can also be really helpful. And in this example, the agent's restarting metric allowed us to catch a deadlock in the KV store package where a Celium agent was trying to write to a channel and after 128 entries, it couldn't write anymore. So it entered a deadlock state and as soon as it enters a deadlock state, the agents would fail health checks and that would end up restarting the Celium agent pods. So data path. So in our experience, we found that the data path with Celium has been quite stable for the most part but we did run into a few issues and in this section, we'll talk about a couple of examples and we'll also get into how we can avoid them with the right configuration and monitoring. So this incident started with one of our users reaching out to us saying that they were not able to access their cluster IP service. So when you create a Kubernetes service of type cluster IP, Celium stores the corresponding backends in a eBPF map called backends v2 and there is a related feature called graceful termination where Celium monitors for pods that are entering a terminating state and proactively removes them from the backend list so that they don't get any more new traffic. But there was a corner case and if somebody deletes the Kubernetes service while one of the pods is actually in a terminating state, Celium would leak those backends and what happened here is so it's actually really expensive to calculate the total number of entries in an eBPF map so in order to work around that, Celium maintains an in-memory count of total number of additions and deletions to BPF map. So when we use the Celium CLI to check how many backend entries were actually present, Celium told us that there were 3000 entries but when we use BPF tool to check the total number of entries in the BPF map, we were actually exhausting the map. So once we reported this, upstream was really quick in fixing this and upstream also included a test to make sure this does not happen again. So that was really a corner case but after this incident, we did a post-mortem and realized that we had no monitoring on our BPF map usage whatsoever. So we were looking for does upstream have anything that helps us monitor this and turns out there was a metric called BPF map pressure but it wasn't enabled by default in the version that we were currently running on. So this BPF map pressure metric is actually quite interesting because it does not actually cover all types of BPF maps, especially when you are using BPF maps of type LRU, there's a limitation with the Linux kernel. So the limitation is that if you're trying to add a new entry to the BPF map and the kernel evicts an existing entry to make room for that new entry, there's no way for the user space to know about it. So Selium cannot maintain an in-memory count of how many entries are there. Until very recently, this is also like a commit that did not land in any of the Selium versions yet. So if you're using kernels five, six and above, Selium started using a batch API to efficiently calculate the total size of the BPF map and with this, we should also have support for BPF maps like connection tracking. And you would think the community would stop here, right? Because we have everything we need. But no, so Selium actually started working with the kernel community as well to try and come up with an API or some mechanism to expose that information back to the user space. And this is the kind of work that gets us really excited about working in Selium. This is just one example, but there are several examples where the community team has been working with the kernel community to try and introduce features that would benefit container networking. So there's one flag that can help you with right sizing your BPF maps, and that's called BPF map size dynamic ratio in which Selium agent will actually look at the total amount of memory that's available on your node and it will proportionally size your BPF maps. And there are a few more gotchas that you need to be aware of with BPF maps. One of them is if your users create a network policy of type allow all. So you can easily fill up the policy BPF map depending on the total number of identities that are present in your cluster. And if you find yourself in a situation where you have to resize the BPF map, currently there is a limitation or a bug which ends up which results in packet drops due to missed tail call events. And it's currently being addressed upstream and I think the fix should be back put it up to 112. So talking about metrics, one another metric we find really really helpful is this agents metric called controllers failing. So Selium agent itself is made up of several independent controllers that are tasked with just one thing. And every time this, any of those controllers fail, Selium bumps this flag. And for a long time we were not able to find out which controller was failing until recently as with everything, things keep changing very fast in Selium. So there's a new PR that exposes this information and in future you should be able to know exactly which controller is failing. And in order to get a holistic picture of how your BPF related features are working, there's also a metric called BPF map ops total that allows you to monitor how your BPF calls are performing. So if you were to recommend one single tool to add your toolbox, that would be BPF trace. And BPF trace is a high level tracing language for Linux systems and it uses CBPF to help you debug different aspects of the kernel. And I'll talk about a couple of examples. So in this, so in this example, one of our users were reporting packet drops and this was happening because our Selium identity was getting corrupted. And Selium identity is made up of two components. So there's a pod local identity and a cluster identity. Both of them together form a global identity which can be used to enforce network policy when you mesh your clusters together. So at Datadog, our cluster provisioning system randomly assigns a cluster ID so sometimes the cluster IDs can be big. And what Selium does is this identity information gets serialized into the kernel's SKB mark and gets restored somewhere else in the data path. There's a completely unrelated feature to support node ports in ENA mode which was also using the same SKB mark. So what would happen is sometimes the eighth bit in the SKB mark would be wiped out and that ends up corrupting the Selium identity. But the most important part here is that we were able to use BPF tool to trace the kernel's SKB mark value and figure out that the value was actually being changed. And there's also another talk we gave in last Selium con that also uses BPF tool to do similar kind of debugging. And I would like to summarize this section with another pull request by Laurent here where it really emphasizes the power of BPF where we were actually able to fix one of the kernel bugs by repurposing one of Selium's EBPF programs. And there's also talk, this talk is also available on YouTube so check it out if you're interested. Don't think we have time for additional knobs. So what did we learn by running Selium for the last few years? So sometimes when you look at Selium from a distance, it can look like a black box, but the community has been building a lot of tool, community has been building a lot of tools to make it easier. So don't hesitate to dive deep. And depending on the nature of your applications, sometimes issues can take a while to appear. So make sure you invest in testing and the right kind of observability and invest in things like nary rollout so that you can catch the issues before the landing production. And we really believe that Selium is your best bet to leverage EBPF for container networking. And I would like to emphasize that your infrastructure might be unique and the kind of features that you're interested in or the kind of bugs you're running into might not be what the rest of the community is running into. So make sure you're engaged with the community, let the community know what's working out well for you, what's not working out well for you. Yeah. And thank you. If you're interested in working on weird and fun issues like this, we're always hiring. You can reach out to either of us. Thanks. Okay, I think while our next speaker just gets mic'd up, we might have time for like one quick question, maybe. I see someone approaching the mic. Awesome. Is that cluster ID greater than 128 bug? Is that fixed? Yeah, good question. Where do we go for more information? Yes. So there's an upstream issue that discusses this in a lot more detail. So currently we've worked around it by making sure we don't use the node port feature along with cluster ID 128. So currently it's actually not addressed upstream, but so in order to work around that, we need to implement a related feature using EBPF. So it's still open, but I can talk to you more about it after this. We have a worker wrongly. Yeah.