 Hello everyone. Welcome to our talk. It's great to be here in Amsterdam today and today we'll be talking about tales from an EBPF program's murder mystery. So I call it a murder mystery because it's really catchy but there might be a liar or two in there. We'll figure it out soon. So this is basically a story about how probably one of the most important EBPF programs in our Kubernetes clusters randomly vanished and this is a story really about how we investigated the issue and how we got to the bottom of it. So quick introduction about ourselves. I'm Hemant. I work on the computer data point team at Datadog and I'm joined by Will who works as a security engineer on the cloud workload security team and if you haven't heard of Datadog we are a cloud monitoring and observability company and here are a few quick facts about us but the most important one here really is we run hundreds of Kubernetes clusters with thousands of nodes in them and all of them run Selium and we run on all major cloud providers. So Datadog does a lot of things but the most important ones for this talk are a couple of things. So Datadog has a component called Datadog agent which runs as a daemon set in Kubernetes clusters and Datadog agent allows for collecting things like logs, metrics, traces and some other telemetry to power many other products and within the Datadog agent we also have a product called cloud workload security that allows us to detect threats in runtime using eBPF and here's a quick outline of what we are going to talk about today. So we're first going to talk about how our users notice this issue to start with and we'll cover some quick background on Linux traffic control and how Selium and Datadog agent uses traffic control and then we'll get into the core of the investigation and some of the lessons we learned during the process. So it all started with an investigation with an incident of course and we manage a compute platform internally for Datadog on top of which all the Datadog applications run and one of our internal users started reporting connectivity issues to us and our crime scene looked a little more complicated than this because the issues they were reporting to us were super short-lived and by the time we could get there and collect some evidence the issue would totally vanish away. So our internal compute platform users were reporting connectivity issues for some pods and the internet containers seem to be constantly crashing and the readiness probes were failing and there are also some unexpected network policy denials and the whole issue was really hard to reproduce because it was constantly being resolved automatically. So the first thing we do with Selium generally whenever there is a connectivity issue is we do Selium monitor logs and Selium monitor logs is like looking at a real-time CCTV camera footage into your network, into your Kubernetes clusters and you could also use Hubble for a similar use case. So when we started taking a closer look at the Selium monitor drop logs we noticed that we were seeing packet drops for traffic that was going from an identity called 93739 to identity number one and identity one is actually a reserved host identity and 93739 is just one of our endpoints and these packets look like their SYN Act packets so they're actually response packets to some other request. So because it's the destination is reserved host this is actually a response packet to health checks from kubelet. So this is the reason why our health checks were failing and our pods were constantly crashing. So a quick background on how this particular flow works so in order to allow traffic from kubelet to endpoints on the same host, every endpoint has a default policy in place which allows packets to go from kubelet to the endpoint itself so that pods can respond for health checks and whenever Selium sees this ingress connection Selium also updates connection tracking entry so that when the response packet is ready to be sent from the endpoint to the kubelet the response would automatically be allowed based on the connection tracking and it looks like for this case we were not seeing an entries being updated in connection tracking and a quick background of how Selium in our cluster is set up so every pod that is being created on a node gets its own virtual ethernet pair and every pod gets its own network namespace so this pods network namespace is connected to the host network namespace using this vEath pair and by the nature of a vEath pair every packet that is sent on one end of a vEath pair gets transmitted immediately to the other end and Selium also installs a few route table entries and IP rules to make sure that the packets can go in and out of the host or out of the pod and Selium also implements a lot of features using eBPF and one of the most important BPF programs is a program called BPF LXC and it has two sections one is for ingress and one is for egress and Selium uses traffic control to invoke these BPF programs on those interfaces and I'll hand it over to Will to talk more about traffic control Thanks, Hemans. Yeah, so we've been saying a lot today that Selium uses eBPF to monitor traffic technically speaking there are dozens of ways to actually use eBPF to monitor traffic and and potentially modify and mangle with network packets and traffic control is one of them and the reason why we need to deep dive into it is because you need to understand basic concepts about traffic control in order to understand the murder mystery See traffic control in general is a pretty complex subsystem of the newly external that is used usually to shape the network traffic on ingress and egress in other words this means that given a specific interface you can actually monitor all each and every single packets that coming in out of this specific interface it usually works with something we call a queuing discipline so there are a lot of different you know queuing disciplines but the one we care about here is the cdsact one so this one is specifically what is was specifically introduced for eBPF use cases and has two main hook points ingress and egress hook points this means that on these hook points you can attach eBPF programs that we call direct action TC filters they're called direct actions because whatever the output of this program is is going to decide the fate of each and every single packet so you have three different output out of all those but the main ones are those three ones tcact okay which means that the packet is allowed and go through tcact shot which means that the packet is dropped no matter where it was going and tcact on spec which means that the filter doesn't really want to make any decision yet and allows the next filter to make the decision for you filters are identified by handles so these are either hard coded or you know numbers decided by the kernels and allocated at runtime and you can also specify priority levels for your filters so the priority is actually used to define the order of execution of those different filters the lowest priority is actually going to be executed first and at a given priority level the program that was loaded last is the first one to be triggered so with this background let's see how selium uses and configures its own TC programs there are three main things that you need to know about how selium uses TC the first one is that the TC filters always answer either tcact okay or tcact shot in other words this is one way to drop packets and this is another also a way to make sure that a packet is allowed to go through to either a workload or to go out the point here is because of this and also because selium hardcodes the handle and the priority to one also side note this is expected selium you know owns the network data path so it was expected for them to use you know a priority level and the handle that is high enough so that they know that they are going to be to be the first one to be called while a packet comes to an interface but because of this and because we wanted to use tc as well to implement our own network policy use case we had to work around these parameters and figure out a way to still introduce TC filters while making sure they would be triggered and this is why we used so first of all we decided to hard code again the priority one as well because we wanted to use you know the execution ordering rule that says that the latest inserted program is going to be the first one triggered and again this is you know the the rational behind this was that selium would set up the parts set up the networking and then we would load our own instrumentation of the different interfaces we care about so we didn't have hard code any handles because we didn't want to mental or to choose one specific number because you know selium is only one cni odds there out there and other people could be using all their handles as well and yeah we made sure to always insert tc act on spec to be sure that selium would make the ultimate decision when it comes to network packets and that would be you know pretty much it in theory we also added a periodical check to make sure that our filters were still loaded again we knew that other people might be using tc so we had to make sure that once we say that we instrument and drug interface we are still here after a certain period of time and yeah again that's pretty much it on paper everything should work fine except that it wasn't but yeah so at this point we've been to a lot of crime scenes and we started to see some patterns over time and what we realized is that the murderer whoever that was was only interested in new pods they were not interested in pods that were already running at all so all the reports that we were getting from our users were only talking about new pods and as I mentioned before these pods were completely vanishing away by the time we could get to them so we really needed a reproducer so at this point because we understand that this is only happening for new pods we created a test workload and removed the readiness checks on them and just ran it in a loop until we could hit the reproducer and soon enough we had one so and once we had that pod and the host ready to investigate we wanted to answer the question is celium actually setting up the pod network correctly right so we could do this in few ways but the celium version that we were using at this point in time was actually using the tc binary that's part of the celium agent image itself to install bpf programs through the tc subsystem and we could use the same tc command to actually inspect of the interface of the pods interface has the necessary bpf programs or not so tc filter sure dev and the pod network interface would actually tell you whether the bpf programs are installed or not and here we could see that the egress programs were completely missing for this for this specific pod and the ingress programs were totally fine right so we were totally confused at this point and we wanted to understand why so bpf has a rich ecosystem of tools available at disposal so we decided to use bpf trace to try and understand exactly what is happening and bpf trace comes out of the box with a lot of tools that you can use immediately right so we decided to use exec snoop and exit snoop as i said before celium was using the tc binary to install the bpf programs so exec snoop will basically allow us to trace all the programs that are being uh started up on the host so we ran exec snoop with the filter of with a grip of tc filter and we saw that tc filter replace commands were actually being executed for both from container and to container this corresponds to egress and ingress sections and we also ran exit snoop which basically tells us the exit code of these commands and we saw that they were all completing successfully and no issues at all so this totally confused us again so we reached out to the community and created an issue upstream because we were thinking that maybe this the tc binary image that was baked into the celium agent and whatever was running on the host was different and maybe it's some weird kernel thing so we reached out to the community and asked for help and within a day we got some response from paul and daniel and they seem to think that the presence of another ebpf program called classifier ingress security on the host was what was correlated to this uh connectivity issues in the past and they asked us to look into it but we still needed proof to understand why this is happening and why exactly this is happening so the answer to that is writing a little bit more bpf through bpf trace again but this time we had a custom bpf trace program and we had we were monitoring we we had some hook points k probes and trace points on both tc replace add destroy and also on q dis destroy and create and we made sure to log the pid and the probe name and also the user stack at the point where the where we hit this trace and once we ran this we started to get some really really interesting answers and at the beginning here you can see that the from container and to container are being successfully installed by a pid ending with 395 395 and 398 and there were classifier ingress security bpf programs that were being installed by a pid called 2659 and the same pid was actually destroying bpf lx e2 container and even sometimes from container right so what is 2659 it's data dog agent so was it really us all along deleting this and here when we looked at the stack trace it actually told us that we were calling something called flush inactive probes so for some reason data dog agent was thinking this was inactive and at the same time we also realized that data dogs on cws network policy coverage was also missing and something's broken there too so if only celium agents programs are being deleted why was data dog agent also being impacted so to learn more about that I'll let we'll talk about it so at this point we also wanted to understand what other netlink messages were being sent because the tc binary does not use the netlink library directly but the cloud workload securities data dog agent was using netlink binary to interact with the kernel so we wanted to capture all the netlink messages that were being sent to the kernel and we realized that there was a kernel module called nlmon that allows us to monitor all the netlink messages so we built the kernel module loaded it and you could actually run tcp dump on the virtual interface and you could get all the netlink messages but what we realized is that the protocol that was parsing this netlink messages did not have all the information that we wanted so we wrote a custom one that captures all these messages and we'll we'll talk about that yeah exactly so we were still not really confident about what was going on because it was clear that data dog was deleting the program but with all the gods you know the god whales that we put in place and you know the perfect plan that we had in mind um you know everything should be working fine so what's going on why did we end up deleting cillian's filter um so we built another tool to get a bit more context about what exactly was going on because you know the the the scripts that him was talking about were great to identify you know the full-tip program but didn't really provide the context of the different events that eventually led to the deletion so we wrote tcprobe it's an open source project you can have it on github it will you know output more context about the different tc operations that happen on a host so on the right you're going to have the output of the program on the left you have some kind of a graphical representation of what's going on so the race condition happens on starter when a new pod starts and when a new interface is set up by both cillian and data dog so let's see events by event what's happening and how things unfold to uh you know a murder basically so first a new interface comes online and then you know cillian creates a new cls act to disk to instrument this specific interface so far so good this is expected then cillian moves on to installing its own tc programs so this one specifically is on ingress and as expected it has prior one handle one so far so good as well and then cillian picks up on the new interface so we have ebpf programs to register i mean to detect when new vet pair interfaces are registered so we actually get you know notifications from the kernel when a new interface is ready to be instrumented and this is how we detect the new interface and decide to add our own tc programs as well so here you can see that we have our own classifier ingress security program and it's using handle two again we didn't hard code any handles so this handle is actually provided by the kernel and the race actually occurs when we end up being faster than cillian to instrument the interface that cillian created and the bad news is because we did not hand hard code any handles we end up getting handle one because this is you know the first filter on this variety level and the kernel just grants us handle one the issue here is that cillian didn't prepare for this because they did use handle one as well but hard could and because one of the use cases is to atomically swap the filters they use by default you know the flag to swap the filter at handle one which means that they atomically swapped our filter and well to to their point we didn't prepare for this as well because we have a theoretical check to make sure that our programs are still loaded and by default if they're not then we clean up everything that a kernel gave us in this specific case we still believe that handle one is ours and we clean up behind ourselves which means that we end up deleting cillian's filter so that's pretty much you know the the gist of the race and the race condition is really that in some cases we ended up being faster than cillian to instrument interfaces and this is what explains you know the fact that it was not always happening and was pretty hard to do back all right him and take it away thank you so how do we fix this issue today so cillian as of 1 12 started supporting custom priorities for tc filters and cws also rolled out an update that bumped the default priority from 1 to 10 so that people using datadog agent and cillian things wouldn't break for them and if today if you want to use both of them together you'd have to redeploy cillian with a higher priority with a priority that's greater than 10 so what did we learn from this incident right so we learned that tc filter ownership is very racy by design and a growing number of products are leveraging ebpf these days so over time this issue is only going to get worse without explicit coordination so every player that's using tc ebpf needs to follow the same kind of rules so folk should always use tc act unspec or tc act short but never tc act okay because if you use tc act okay the packet would be scheduled immediately to the network card and especially monitoring products should not do this and do not ever hardcore the handle of one because you would not know what other products are using and for power users make sure the priority and handle can be configurable so that then they can mix and match different components and deleting cls act q-disk is also very racy so you cannot do it and you cannot delete it in a safe way so that's for right now but how do we fix it the right way so the right way to fix it is with bpf links and the kernel had support for bpf links for a long time but the traffic control subsystem does not have support for it so there's some work happening upstream to add the support for the traffic control subsystem and i think it'll be merged sometime around may 2023 and there's also a great talk by daniel bachman talking that gets into a lot more details about bpf links and traffic control bpf links and how we could use them and do check that out so in conclusion this was actually not a murder this was just an accident because there were no clear guidelines established on how different people should use them together and this was a complex incident that actually took us several weeks to get to the bottom of and we were really thankful for the community for all the help and i would also like to thank some of my team members jared eric lorant and maxim for all the help and if you're interested in working on weird and fun networking issues like this we're always hiring you can reach out to either of us on email or twitter thank you