 Hello, we're very happy to be there with you today. We'd rather be live with you and hopefully we might be able to in Los Angeles, right? So we're Laurent and Eric and we're going to share stories about low-level problems we face for new communities. We shared a few stories in the past about control plane challenges but today we're going to focus on what happens on nodes. So in case you don't know DataDog, we're a monitoring company that today we're not going to talk about the product. We're going to talk about the infrastructure behind the product and it's pretty large. We're running tens of thousands of hosts and dozens of clusters with up to 4,000 nodes and it comes with multiple challenges and this is what we're going to talk about today. So we have a few stories not completely related but they all happen locally on nodes and we're going to focus on that. So let's start with this first story about disappearing containers. So it started with this simple graph here when we migrated an application from a virtual machine to Kubernetes and we noticed that the CPU profile was very different which was a bit worrying, right? So we looked at the pods running this application and we discovered that several of them were in great container error, actually about 10% of them. So they were not ready and not processing traffic. We found very easy solutions to address this by deleting the pod or resetting the cubelet but it wasn't very satisfying. So the first thing we did was try to understand what this error was. So we looked at the cubet logs and basically the cubet is getting an error for the runtime container in our case because it seems to be trying to create a container that already exists according to the runtime and it's doing this a lot as you can see here with 62 attempts, right? And it gets to trying and getting the exact same results. So of course we wonder, well, who is right? Who is wrong? So the cubet believes it needs to create a container. What about containerity? So if we ask containerity, this container is actually running completely fine. The next step was to try and reproduce a problem in a test environment and we have noticed that it seemed to be related with cubelet restarts. So we did a very simple script that restarted the cubelet frequently and after only an hour, as you can see, all the pods in the deployment were in the same error. So we were able to very easily reproduce it, right? So what is this error exactly? So we went to the cubelet code and we looked for it and we found that this error is appearing in the runtime manager in the sync.function. So it seems that the cubelet wants to create this container and fails. How is this function called? So this function is called in sync pod after computing pod actions, which is basically the reconciliation that needs to happen between the pod spec and what's in the runtime. And in this case here, it seems that the results of this computation is wrong. How could that be? We looked at logs, but we couldn't find much in logs even when we increased the velocity level. So we did a very simple thing and we added print statements in the code. And it turns out it was very efficient because as you can see at the bottom of the screen, a container, the container 15, that is completely fine in the first output statement is missing in the second one, but the container is still running. So it seemed that container statuses, which has the statuses of all the containers in the pod is actually wrong. So we wanted to understand where this container status was coming from and how it could be wrong. So here is another view of the main component involved in this data pass in the cubelette. So everything starts with a sync loop iteration, which is the main sync loop in the cubelette. This sync loop receives events on channels and this event can be event from API server such as a spec change, event from the pod lifecycle event generator, which are events from the runtime and other types of events. When there is a change to a pod, it's then going to call pod workers, which are going to get the status from the pod cache and call the sync pod function, which is the function that was failing when it was calling the runtime before. So it seemed that when the status is retrieved from the pod cache, it's wrong, which triggered the problem that I wrote. So what is a pod cache? The pod cache is a cache of all the pods in the cubelettes and how is it maintained? So the pod cache is maintained by the pod lifecycle event generator, the plague, which is the component responsible for observing what's happening with the runtime, updating the cache and generating events that will then be received by sync loop operation. And it seemed that this cache gets corrupted somehow. So here's what we know. The data in the pod cache is wrong. The pod cache is managed by the plague and there's nothing in the plague that seems to indicate situations where we can get correction. So what's happening? And at that point, we were wondering, well, what is something else is modifying the content of the cache? In particular, the container status is filled, which is what's wrong here. So we looked at the code of the cubelet and after a lot of digging, we found this function, which is sorting the container statuses. And what's important here is that this function is actually sorting in place and modifying the slides. So we now have another code in the cubelet that is modifying the pod cache. It's not only the plague anymore. And so we looked at where this function is called. And here you have the sequence of functions involved. And what is important here is sync pod and syncopation, which are two functions we saw before, are actually ending up calling this sort operation. So let's go back to the cubelets to get back to the function that can actually modify the pod cache. So as we've seen before, the plague can modify the pod cache, but also the function generate API pod status that can be called in two different cases. One, when there is a pod update and one, when there is a plague event where the container died. And the problem is the handlers for this events are running in separate core routines so they can raise and which can lead to cat corruption. And here is a summary, right? You can have a concurrency between either sync pod if there's a pod update from the API server, if the container dies on the plague or during a plague release if there was something happening on the runtime. So we had a theory and we wanted to validate it. So what we did is instead of sorting the slides in place, we made a copy of it and then did all the operation on the copy. And look, everything is running now even if we restart the Qubelet every a few seconds. So we created a PR upstream and we're pretty happy because we've found and fixed the issue. I also want us to create Naya for doing all the work on this investigation and diving me very deep inside the Qubelet. So key takeaways from this is that the Qubelet internals are complex but they're accessible if you spend the time. Since the internals are complex, we're not 100% sure that our fix is the best one and that the database I described is the full one but what we know is the fix definitely works and solved our problem. And the PR has not been much upstream but we hope it will soon or an equivalent PR fixing the problem. We're a bit lucky on this one because we cut the problem in staging which allowed us to deploy to prediction completely fine, which was good news. So in this first story, we had a story about the Qubelets and now Eric is going to dive even deeper into something that's happening in the runtime itself. Yeah, thanks very much. So this one is titled Random Failures which as we'll see is a double pun on the actual problem. So the problem we saw was that occasionally we had a database crash and it was preceded by a rather puzzling log. The database couldn't open DevU Random and it was getting an operation not permitted error. And this is happening occasionally. I mean, clearly the database works most of the time and then all of a sudden it can't open DevU Random. So what's happening? If we look inside the container, everything is fine. The file permissions on DevU Random are perfectly open to everyone, to read everyone, write everyone. We don't really see why we should be having a problem opening DevU Random. So we write a straightforward, very stupid reproducer, very blunt. It just sits in a loop and keeps opening DevU Random and errors if it fails to do so. And very quickly we see that we're able to reproduce the problem. We get the operation not permitted message and what's more interesting maybe is that this happens every 10 or so seconds. So it's extremely regular. If we search a little bit more for logs that the Kubelet might be emitting for this pod, we realize that in fact every 10 seconds the CPU manager component is emitting a reconciled state log message. And if we look at the details of that message we see that it's actually includes a configuration for the CPU set. So I need to explain a little bit more what the CPU set is. In this situation, our pod is in guaranteed QoS class meaning that it has requests and limits which are all specified and all equal for all the resources it requires. It also makes an integer request for CPUs. So six in this example. And finally we use a Kubelet configuration where the CPU manager policy is static. Now the static CPU manager policy is there to allow requesting and setting aside a specific set of CPUs for a container and doing so using the CPU set control group. So now if we look at the flow of configuration here so in the container runtime, so we have the Kubelet which talks to container D over GRPC so telling container D what containers need to be launched, what configuration or characteristics they need to have container D in turn fork execs container shims which are per container and communicates them over time through GRPC also. And the container D shim fork execs run C to actually start containers and potentially update them again over time. So for instance, it will update the C group files of the containers. So if we follow the smoking gun through that flow and we look at container D now and we see the container D also is every 10 seconds reflecting an update to container resources. Not a big surprise at this point. We follow further down and we trace the shim. We see that the shim every 10 seconds launches run C and in particular launches run C with an update instruction. Tracing the run C execution itself we see something rather surprising on the other hand. We see that the update to control group devices section is starting with a deny on all devices and then followed by individual allows for the rest of the devices for the ones that we actually should be allowed to access. And so we found a race in effect. There's a window where for a certain amount of time all devices are not allowed and progressively some will be re-allowed until finally everything that should be allowed is allowed. But there's a definite window of opportunity here to attempt to open a device and to find that we're not allowed to do so because of the default deniable. So we found the issue. And in fact, we're in luck because the problem has actually been fixed upstream. And so we really only needed to upgrade run C here to resolve the issue. The way run C resolved the problem is by emulating the changes that would be taken on update and checking them against what is in place and only reflecting the differences between the desired state and the current state. So this was an interesting problem. We re-learned how the flow of configuration from through the runtime from the Qubelet down the way down to the container. And we learned about C group device updates as well and possibly a little bit about the CPU set. All this investigation goes to Benjamin Pinot who delved into this a little while back. Thank you very much to him. So key takeaways. It's really good to know the container runtime flow and understand it and how the different components interact. It's also important to realize that here we had several parameters influencing the overall behavior. For instance, we had pods with guaranteed QOS integer CPU requests and the CPU manager configuration for static. If any of those three parameters had been different we would not have encountered this issue. And so now I'll let Laurent dive into an even deeper investigation with the network layer of Qubeletis. We're not going to talk about how Budapest can become zombies. So this problem started with containers remaining in status container creating for a long time. And the error message was fun in our sense of fun which is, well, very weird. So here the Qubeletis telling us that it can create the pod sandbox. So it can create networking in the pod because it's failing to add it to an IP address because this address is already in use which is weird because it never happened before and suddenly it's starting to open. So address already in use. Let's see if this address is banned on the host. It's not. Can I add it to an interface? I definitely can. So nothing obvious there. So of course it could also be an IP address from a container. So let's look at network namespaces and look at IP addresses in them. None of the network namespaces have this IP address. So we've done the obvious and now we need to take a step back to exactly understand what the components are involved in this. So let's look at our CNI plugin in this cluster. So we use the LEAF CNI plugin which is basically allowing us to attach IP addresses from the underlying cloud provider AWS directly to pods and not have a novel A. So it started with an IPAM plugin responsible for the AWS API call. And then there is an IPvlan plugin that is creating the main pod interface and giving it an IP. There's another interface that is only used for communication with the host and it doesn't really matter here. What is interesting is what's happening with IPvlan. So now that we know that on this host we're using IPvlan, let's say we can create a new IPvlan device with this offending IP. And as you can see on the top of the slide, if we try, we actually can't. And we get a very clear error message saying that this IP address is already assigned, sorry, to an IPvlan device. So now I want to mention here is that the message is very clear that it doesn't make it to the cubelets because of the net leak library used by the plugin which only use the error code and not the error message. So we only have addressing use and we don't have the full detailed message which would have been helpful. So now we need to understand why your senile plugin is using this IP. If we look at the registry of IPs used by the plugin, this IP is known, which means it has been used and assigned to a pod in the past, but it's been released. But somehow it's been released, but not completely, right? Because the IP address is still bound to the IPvlan device. So we looked at how IPvlan handles deletion in this plugin. And as you can see here, we have a base code to delete the interface, but we never run it because we use chaining and in the case of chaining, the deletion call is never done. So we never delete the device. However, it works usually because the runtime at the hands of the deletion of a pod will actually delete the network namespace which will take care of garbage collecting everything. So what we know so far, the IP has been located to a pod. The pod was deleted and the IP was marked free. The IPvlan interface was not deleted when the runtime called network deletion. And so we can reuse the IP. What's important about network namespace deletion is when you delete the namespace, the channel will tell you, okay, I can delete it, but it won't actually delete the namespace until every single process running in it is done running, right? But what's important here is everything is fine from CNI and from the runtime, right? But the channel hasn't finished deleting the namespace. Now that we have this piece of information, can we reproduce? So what we did is we took a pod, we exact into its network namespace from the commands. And as you can see here, we can see here that the network namespace of this process is actually the one of the pod, right? If we delete this namespace, everything works. The CNI marks the IP has released. However, we can't reuse it and add it to an IPvlan device. And what's interesting is if we enter the network namespace of the process we created, we can still see the IPvlan interface. So we have a very good idea what's happening here. So let's see if we can find the process holding this network namespace. So what we did here is we looked at all the processes and the network namespaces and we entered them and we looked for the IP. For our process, we were able to find it. However, we were not able to find it for the IP address that was already in use, right? So that was, we have an idea, but we can actually trace the problem back to an actual process. So we took a step back because we had a good idea what was happening, but we were not able to pinpoint exactly what was the conflict. We had a few additional data points. We knew that the problem started with a node and slowly propagated to others and that regularly the problem was completely disappearing and we're starting again. So what are we looking for? We're looking at the process. We're looking for a process that is running on all nodes that is regularly either stopped or started or redeployed. What kind of process is that? Maybe a demon set, right? So we look at demon sets and we have a few demon sets, but there's one that does a lot of things and it's data log agents. So we actually were able to trace back the fact that when the data log agent was redeployed, the error was fixed. So we reached out to the team and they told us, well, we actually have a feature that is allowing us to instrument the contract and that's inside network and space of containers. And so we looked into it with them and it turned out this new feature was using a long-lived connection on a netting socket to get this information. And of course this was holding the network and spaces. And to fix that, we just had to use short-lived connections in terms of long-lived one. So the network name space could be released by the kernel. Key takeaways, interaction between all these components are complex. It's also very difficult in the kernel to understand what's happening at the networking level because a lot of information can't be found. So you need to guess a lot. And also I must say that we were lucky because we code this in staging. And also this error was made very visible by the CNI issue. But if we hadn't seen it in the CNI issue, we would have leaked network name space with unknown consequences, right? And that's it for this network debugging problem. I mean, I love networking, but I think the story that Eric is going to share next, which is even deeper because both the kernels is even more interesting. Yeah, thanks so much. So who ate my capabilities? This is the story that gave its title to the presentation. Okay, a word about Linux capabilities. Capabilities are a feature of the Linux kernel that allow us to give privileges to certain processes that are not normally privileges that don't run as root, for instance. So in the example that I'm showing here, we have the capture program that just displays the capabilities that it has and decodes them for us, which initially doesn't have any and to which I then assign a capability through the extended attributes of the file on disk. And then when I run, so here the cap net bind service, which means this process would be allowed to bind ports under 1024, which is privileged normally reserved for root. And then once it has these capabilities on the file and extended attributes, if I run the program, I see that I have this privilege now. So we wanted to use this for the API server, the CUBE API server, which we run as a pod in certain circumstances. And so quite simply in our Docker file, we set the capability on the file. And then when we run the pod, the strange thing is that it doesn't work. We are unable to bind port 443, we get a permission denied error, just as if we didn't have the privilege. So if we look at the actual process characteristics, we see that indeed it does not have the privilege. It doesn't have the capability. The permitted capabilities, the effective capabilities of the process are all zero. They're all the bits are cleared, even though we should have 400, where the value of the cap net bind service a bit. And yet if we look at the capabilities on the file system, they're there. It's what we built in the Docker file. So again, what is happening? If we think about the usual suspects, there's the no new privileges flag, we check and that isn't set. We're not using no set user ID file systems. Our kernels are booted with file capabilities allowed. And yet one interesting observation here is that since the capabilities are all clear, it means, and yet the program it ran was launched by the kernel. It does mean that the file capabilities literally don't apply according to the kernel. And another thing, we build the image on our laptops and interestingly it works. So that prompts us to look at the layers of the image that we build. And here we realize that the extended attributes of the file are actually different. In the laptop case, we see that the capabilities are encoded using a version two. And in the image built up by our continuous integration has a version three. And we also see that in the version three, there's more information. So what are these attributes exactly? In fact, now we realize that there's actually an extra option on the get cap command, which we didn't notice initially, Dasher, which causes the extra information, the version three capabilities information to be displayed. And it's in fact, a user ID, the root ID. So what exactly is this root ID? Well, in our CI's Docker configuration, we actually run it with user namespaces. User namespaces is a great feature, very applicable in this particular context where we want to run untrusted code. And so what user namespaces do is that they actually remap user IDs and group IDs. And in this situation, the kernel needs to ensure from a security perspective that if a program run in one username space, grants a certain set of capabilities to a file, you can't simply take that file and run it in another username space and acquire those same privileges. That would be an immediate security issue. So what the kernel does is that it persists the root user ID of the username space in the extended file attributes so that it can ensure that if it's run in another username space, unless the user IDs match, sorry, they will not be applied. Now, as it happens, this is irrelevant for container image builds because clearly we want to build images in one place and run them anywhere else that we wish. So we really don't want a V3 capability here. We just want the V2 without the root ID. And in this case, the fact that Docker Moby build actually persists this information, this user ID is an error. And so we developed a fix and the PR was merged upstream and we should have to fix in Docker 21. So key takeaways here, well, read the manuals. The capabilities manual page is really excellent. It contains a lot of detail, a lot of explanations and allows you to reason about what is happening with respect to capabilities in great detail. Also read the get capman page. This is the one that we missed initially where we didn't see the dash n option which would have been really useful. And finally, reproducible build. Yes, by all means, but to within bugs. So there we have it. And just to conclude then, Kubernetes does an awful lot of heavy lifting for us and all the associated software around it, the container runtimes, the container network interfaces. However, the bigger your organization grows, the more strange issues you are going to encounter and these will challenge you to go deep in your understanding of Kubernetes and the overall ecosystem. But it is extremely interesting to do so. We have maybe a strange definition of fun but we definitely like digging into these kinds of issues. We think the debugging also teaches us an awful lot about the platform and that overall it allows us to give our users a better experience because we're able to help them better because we know the platform well. And it's not easy, definitely, but it is feasible in particular because here we are dealing with open source and so we're able to dig through absolutely everything but it is definitely worth it and we find it's very rewarding. So thank you for listening, watching and we hope to see you one day in person. Again, thank you very much.