 Okay, all right, so welcome everyone to Lessons Learned running open infrastructure on bare metal Kubernetes clusters in production. My name is Alan Meadows. I'm one of the lead architects for AT&T's network cloud. And I also work in the OpenStack Helm and Airship communities. I work on OpenStack Helm and some other things within AT&T. So this presentation is gonna be a technical presentation where there's not any real stringent ordering to sort of what we're gonna cover. There's basically two sections. We're gonna talk about some Docker experiences and some Kubernetes experiences. We're just gonna kind of go through them one by one. So before we begin, we have to, we're gonna talk a lot about some negative experiences with Docker and Kubernetes, but we wanna sort of start out in preferences that we love Docker and we love Kubernetes. It's transformed the way that we deliver software. It's effectively containers has become the new unit of software delivery for us and that's been great. We have no plans to change our course and do anything different. Despite some of these challenges that we faced, it's still the right path. And really the core message that I think we want people to leave with is that Docker and Kubernetes are usually doing things that make a lot of sense, especially as they release new versions and bring new features. It's just sometimes those things don't make sense for us, especially people who are trying to run these workloads or run workloads that are open infrastructure based, low level things like running open vSwitch, LibBert, things like that inside of Kubernetes. I think it's fair to say that we are pretty atypical users of this sort of stuff and often go counter to a lot of the recommendations about running sort of pets and stateful workloads. Yep and so hopefully you guys will, you know, hope you find some of our challenges interesting and maybe even helpful to the stuff you're doing. So Docker, should we? Okay. So we sort of have had quite a lot of experience over the last two years of running sort of infrastructure that lags quite a long way behind the current versions upstream, partly because we wanted to build up a big understanding of how these individual things act with each other. Although it's sort of the end of last year, we found that we were starting to experience some issues with Docker 113 startup in some of our sites where we were seeing some state being corrupted as Docker itself started to get out of sync with Container D. And you know, so we ended up in a situation where we were having to perform far too many manual operations, cleaning up sites, letting us recover things and we were unable to restart Docker to sort of do minor configuration changes without heavily impacting our workloads, which kind of pushed us to wanting to upgrade. Sorry, I think these are out of order a little bit. So a lot of these challenges with 1.13, they caused us to, you know, that the next upgrade that we did after that was to Docker 17 CE. And that was on Kubernetes 1.11.6. And you know, the upgrade was actually pretty trivial. So we were a little worried that it might be a little bit more complex, but effectively an upgrade upgrade was all that it took and it went rather smoothly and we didn't even really interrupt our workloads. However, about a day after we did the upgrade, the host began melting down, they started spiraling with load spikes and we had to reboot them in order to stabilize them. So an upgrade that appeared to be something that was straightforward turned out not to be. And effectively, I mean, our message, our takeaway from this is is really Docker upgrades in production are not as easy as we might first think. They don't necessarily behave the same way in production as they do in labs. And really our philosophy is that these sorts of, we believe these upgrades should be predictable, but they're not. And so really the way that we will achieve these going forward is reprovisioning machines with the target versions. And I think that predictability becomes really hard for us when you start running things that sort of interact quite heavily with the kernel and other bits of infrastructure, whether that's open V-switch or libvert, and the control, the sort of surface area there becomes much larger than it typically is. Okay. So as we touched on that, one of the things that really started to affect us was trying to keep container D and the Docker API in sync on bits of infrastructure and running for a long time. And we started to sort of see container corruption being fairly widespread in some sites that have been up for about a year or so. And then this sort of manifested itself in the pod lifecycle event generator or the P-Leg that you started to see coming out of the Kubelet where we would see a huge load appearing on our machines as the Kubelet continuously tried to pull the Docker API and essentially sort of beat it into a corner quite often. And through this, we sort of, Kubernetes itself is not very great at detecting a sick Docker and backing off. It really is rather relentless in the way that it will hit those things. And so these are not good at detecting that. It's really bad at coping with it. So when you start to see these sorts of issues, you get node flapping and all sorts of other things which sort of meaning you get a lot of churn in the cluster that we try and avoid. And I think one of the lessons we get from this is that projects like the node problem detector, I think we view as being really critical long-term in order to running things in a reliable manner. And so typically, most issues that we saw off, we typically got round by restarting Docker and then failing that, rebooting the host apart from a few sort of occasions when we ended up in a scenario where we had to really get very engaged because there was no appetite for disrupting other workloads that were running on those machines. And so it's because of that, that although we think Docker is a really great tool for development building images and what it offers there is second to none, when you're actually running serious workloads directly talking to container D or Cryo, there's probably a more screwed choice simply because it removes the Docker API from there. So we're gonna move into the Kubernetes section. So the first sort of issue we wanna talk about is a config map mount mystery. And I mean, again, I think the core message here is that transient errors in a system like Kubernetes where they can automatically self heal are okay, but persistent errors are a menace. It either means humans need to get involved or we need to write software in order to find these things and take some action to move them forward. So one of the things that we saw in Kubernetes 1.10 was we do a lot of volume mounts inside of containers, the OpenStack Helm projects and other things leverage these pretty voluminously in order to mount in scripts and other things like that. And occasionally we would see containers that would fail to be able to find scripts that effectively had been mounted into them. And there was really no way to resolve that issue other than literally terminating the pod, allowing the sandbox to respawn. Otherwise it would be perpetually stuck in this state. Kubernetes would never be able to automatically fix this issue. Okay, and we saw these sorts of persistent volume claim issues because we view a config map as essentially being a persistent volume when mounted into a container with a few other things. I think one of the things that we found most scary when we encountered it was we use PVCs to back RIDB databases, RabbitMQ, PostGray and a whole host of other things. And the way Kubernetes manages these devices is it provisions it. The Kubelet makes an attempt to mount it. If the Kubelet can't mount it, it'll make the assumption that you're dealing with a fresh volume and create a brand new file system ready for use. And this is pretty great most of the time and kind of terrifying because you sometimes end up in a situation where for whatever reason if a host has gone down hard and you've got some fairly minor file system corruption, the Kubelet can go around rather happily reformatting devices which kind of made us pretty happy to have both backups and high availability databases. And so we found sort of a rather sort of unpleasant way around this which we've now important codified as a mop within the platform where we can scale down workloads and make sure that we're not going to sort of get that workload attempted to be mounted anywhere. We take our CF configuration files, load those onto the host, manually map the RBD volume onto the host, run an FS check on it. And then at that stage we can unmap the device, free it up and then get our workloads back up and running. Another issue we sort of ran into especially in the early days of OpenStack Helm was there was a lot of- The ephemeral cloud. Truly an ephemeral cloud in the sense that a lot of initially a lot of the neutron agents effectively had private namespaces within their pods. So this would do a couple of different things. One is the network namespaces that OpenStack would create were created within the pods. So those namespaces would stay private to those pods. So when you're doing something like investigating a network issue, an operator would end up having to exec into dozens of pods on that particular host to troubleshoot standard neutron flows. Can I ping the router gateway? Can I ping between the high availability networks? It became a brand new workflow that they had to do and it was troublesome and extremely complex. And also I mean probably the worst thing is that when the pods were cycled due to rolling updates, and this could be as simple as something as changing a configuration item for those neutron services or releasing a new image, it resulted in an actual impact to OBS because the pods would be torn down, all the network namespaces that were created inside of them for OpenStack would be torn down too and this was causing a big problem. So one of the things that we did was leverage bidirectional mount propagation. So we can actually take the host's network namespace and allow that to be visible from the pod and vice versa. So effectively all the network namespace operations that the pods were doing would be visible on the physical node. And this also allowed people to sort of jump back to doing what they were doing in the past in order to troubleshoot neutron. So all of the DHCP namespaces, all of the neutron router namespaces, they would all be there in one place on one physical node. They could hop between them pretty easily. And I think this is sort of quite a good example about where as we've been going along this journey, Kubernetes has been either moving too fast for us or not quite fast enough. And then it was 110 that we were able able to start doing this. Although one trade-off is you can only use bidirectional mount propagation with privileged containers. So as well as providing an easy fix for one thing, it then meant that we had to go through and lock down our pods much more aggressively than we had previously and what permissions we gave them. So other challenges that we've had is process reaping. So container processes, they need to run as a child process of either the pause container or use the host init system. I mean, if you're running something earlier than Kubernetes 1.10. So if you're not doing that, then these defunct processes will sprawl, they'll never be reaped, and effectively your host at some point will melt down. I mean, these are exacerbated, especially if these things are incorporated into cron jobs and other things that are periodically running. Or if you're just doing a whole lot of releases and churning containers, you can get this as well. So you can do things like run into the PidMax limit on your host, or the host just might fall over because it's run out of resources. And I think some non-clamentia that we've started to use internally is sort of the series of guidance that have occurred. And I think this was one of the early ones where we had some health checks that were leaving zombies around. And after, what was it, three or four days of them being introduced, we started to see workloads dropping. Yeah, and the process of trying to make things better with stricter additional liveness probes, we ended up doing the reverse and hurting the system with this. I think an interesting thing to point out is that this behavior is exactly what you want for most things running. However, sometimes it's not what you want. So we don't want reaping to occur when we're doing things like running LibVirtD as a pod. Those things we want to be able to escape the reaping process because when you cycle the LibVirtD pod, you don't want that to tear down all of the QEMU processes that has spawned. And we want to be able to do things like make small changes to LibVirtD, potentially release new versions and things like that, and have that behave exactly as though you were doing it on a bare metal system. So I think we sort of split this out in general for process reaping. We found four things like our neutron pods and other things. Setting a share process namespace to true is the easiest way to fix that. Most container runtimes do this by default with the exception of Docker. And then other things that you could do as a short-term measure that we had was run things as child processes of bash within a container that allowed us to reap processes. But then, because bash isn't in a system, signal propagation and passing down couldn't be relied on. So it has some crudities. Oh, this thing. There's a lot of PTSD with a lot of these issues, so I apologize. I told Alan this morning that I might just be on stage crying. This was a fun weekend. So, shall we set the stage here in the sense that, so with Libvert, we use within AT&T and by default for OpenStack Helm, CIF for Cinder Volumes. And so when Libvert starts up, it spawns a QEMU process which opens the Libvert, sorry, the CIF.conf and holds that OpenFire file handle. And so this was creating problems for us when we tried to restart the Libvert pods and they were essentially getting wedged. We kind of went down a bit of a rabbit hole here where me and one of my colleagues were looking at bugs that had been affecting sort of mesosphere about three or four years ago with similar things. And then eventually sort of just ended up with a rather crude and pragmatic approach of mounting these into the file system, copying the appropriate files over to the same location on the host file system as they were in the container and just sort of allowed us to skirt around the issue. And I think this is another great example of that sort of thing that we set out to talk about which is this is a sort of weird, bandaid-like approach but it's effective because our use case is strange for standard Kubernetes uses. And so we have these processes that are holding open these file handles. We don't want those to be terminated when we're terminating the parent pods. We want to be able to remove the containers and the pods but leave the processes behind which is pretty odd. And this leads us into how Kubernetes tried to close the door on this. Yeah, yes. So, C Group Fund. So this is an issue we had with, I forget the Kubernetes version, but running on kernel 4.15, system D229. A dangling C Group was created whenever a pod with a volume referencing a secret is present in the pod spec. So it doesn't have to be mounted anywhere. So these volume mounts doesn't actually even need to be present in the pod spec and cron jobs really accentuated this problem because effectively they create a whole lot of churn. And what are the ramifications of this? So this would effectively leak C Groups. They would end up causing huge CPU spikes in memory load for the Kubelet, especially when it's scraped periodically. And this is because when we're doing these scrapings we're effectively walking all of the C Groups that include CPU and memory usage for every C Group on the system. And this is another one of those examples where it takes time for this problem to fully manifest itself. So in investigating this issue, hopefully it'd be easy to figure out what's going on here and what's causing these leaks. And really the answer is it wasn't easy to figure out. I mean, you know, various combinations, this would happen. So it wasn't exactly the version of system D, it wasn't exactly the kernel, it was the combination of the two that end up causing this problem, which is why this is really hard to arrive at what exactly was the root cause. This is just sort of an example up on the screen of sometimes you find these issues and whether it's necessary to upgrade the kernel or upgrade system D, these are not always easy to effectively put into practice immediately to solve the problem. So we end up with a lot of these sort of temporary things recurringly running in the system to help clean these things up until we can move forward with the version of the kernel or forward with the version of system D. So I think one other thing that we found was sort of propagated by two things, but in general, sort of actually trying to upgrade Kubernetes from one release to another without impacting the workloads was sometimes pretty challenging because we would encounter changes in the way that Kubernetes manage things. And this is especially relevant sort of when we're trying to run our VNF workloads where we're in an environment where we can't drop a single packet while simultaneously dealing with huge pages, CPU pinnings and other things that the Kublets hasn't sort of isn't going to expect and isn't being gated for in Kubernetes releases. So things that are breaking for us, you know, pretty much only breaking for us. And so we sort of had to really implement some really rigorous testing when upgrading from one Kubernetes version to another, including some fairly rudimentary regression testing. And the other thing that we would quite often find is if you upgraded the Kublet, it would leave existing pods alone. And then on either restarting the Kublet, Docker or even the host, we would find things don't come back quite as we'd expect in the same way. So that was something we had to deal with quite a few times. And so looking at an example of this, around about we jumped from what Kubernetes 1.8 to 1.10. And at some point coming in initially with 1.9 and then getting more developed with 1.10, Kubernetes started to manage and deal with C groups, sorry, huge page C groups as part of the preparation for the preliminary support that came in and 1.11. And so this broke Libvert for us in terms of allowing Nova VMs to successfully allocate huge pages. Suddenly all the processes are Libvert D and then it's children, QEMU were being denied through C group permissions there. It also broke us being able to survive restarts because Kubernetes started being much more aggressive in the way that it would tear down C groups. And so it would also take all the child processes that were underneath it. And this is an example of something where if you're running a pod in the hosts PID namespace that's forking things, typically that's exactly what you would want but it really came to hurt us. And so in the short term to tackle the first problem, we started editing the C groups as part of our startup script within the pod which allowed us to actually adjust the values as we'd expect in order to allow us to boot VMs again. And then longer term, we sort of, I'm kind of embarrassed with this on screen. We took an approach that's much more aggressive and sort of broke completely out of the Kubernetes defined C groups that were created by the Kubla for processes. So we created our own for CPU, RMA, and huge pages. We then created that C group. And then within the pod, instead of launching lib third's D directly, we would firstly CG exec into that new set of C groups we've created and use system D run to start things as a transient unit on the host. And this allowed us to get the same sort of operation that we'd seen where they're 1.8 back where we can both run huge page enables VMs really easily and we could also restart our pods at will without impacting those workloads. So another issue that we experienced is his time stealing and also negotiating CPU pinning with Kubernetes. So our compute nodes are Kubernetes nodes themselves, things like the Nova Compute agent, lib third, all of those things as we sort of covered are containerized on those hosts and those nodes running, you know, VNF Nova workloads are actually members of the Kubernetes clusters. So this is again another one of those examples where probably Kubernetes doesn't expect you to have this other resource running on there, something like Nova attempting to do CPU pinning. And we run VNF workloads and they expect to have dedicated access to cores. Obviously when that doesn't happen, performance suffers and this is actually really bad when you're doing things like voice or other things and our tenants, testing tenants started report that mysterious drops in call tests and other things like this and we started to investigate what was going on. So by default, the Kubernetes workloads are scheduled on any core, even if those workloads are a member of ISO CPUs which is a standard way that people are protecting certain CPUs for VM workloads. And so on Nova compute nodes, which are also Kubernetes, we needed a way to restrict Kubernetes from using ISO CPUs. So given an ISO CPU set that looks something like what's up on the screen, we needed to come up with a way to launch the Kubelet with a custom C group that would effectively allow us to provide a limitation because this functionality at this time wasn't something that was baked in. So the approach that we took was creating a new custom C group that held the inverse of ISO CPUs. So effectively telling the Kubelet that it could use the opposite CPUs of what we were trying to dedicate to VNF workloads. We needed to build a new system D unit that launched this C group creation prior to the Kubelet launching up. And then finally we needed to create the actual script that set up that new Kube whitelist C group and pass that as a new parameter into the Kubelet. We don't have an example of that script, do we? No, we don't have that script. Anyone is curious? We can certainly share it. I think it's got some pretty impressive use of say. I think said is called three or four times in that script together in a daunting chain. So another issue that plagued us a little bit was slow state change propagation inside Kubernetes. So we leverage init containers quite heavily. Those init containers do all kinds of things for us. One of the things that they do that's really important is dependency checking. So it allows us to kind of roll things out in an ordered fashion without needing to have some master orchestrators kind of doing them in any special order. We can just kind of throw them at the wall, let Kubernetes sort them out, but we can also treat crash loops as something to pay attention to rather than something to just occurs naturally. And so what we started to notice is that dependency checking would start to take longer and longer and longer. So even though we'd see resources like a job, nobody would be init had already been done for 10 or 15 minutes. We would look in one of the other init containers for something that was depending upon that and it was still sitting there waiting for that to complete. So there's a disconnect going on. Something we also saw here, which we spent probably longer than we should have scratching our heads over was you would start to see different responses from different things, where you would see a kube control describe on a pod would show the container having started, having exited, but then if you ran kube control get, it would show it's still impending. So you would see depending on where you query it, very, very different messages being fed back. So one of the reasons that we believe this is occurring is that we double the number of pods per node. We run pretty highly stacked nodes. But at the end of the day, I don't think we really doubled all that we needed to. And so one of the things that we had to do that greatly relieved this problem is changing the kube API burst parameters and the kube API QPS parameters. And that really allowed Kubernetes to actually keep up much better with all the state changes that were occurring and remove this issue from occurring. Okay. Oh, this. When was this? It was the first of June or July. I think it was definitely the first of a month. It was a rainy day. It was bleak. It started so well. And then this happened. So we have several cron jobs within our system that should do routine maintenance tasks, whether that's checking OSDs or moving dead heat engines. And we discovered this rather nice quirk, which is pretty scary when it happens, which is if you create a cron job with a container that's explicitly calling some path that doesn't exist, Docker freaks out. And so you start getting OCI runtime errors coming back and Kubernetes does its thing where it'll then have another shot. And within a couple of minutes, you start to see this sort of thing building up from that previous manifest. And then within a couple of hours, you start to see nodes become very unresponsive. And then about 10 minutes after that, phones start ringing. And so this is kind of an area where it's pretty hard for us to come up with a clean solution other than sort of emphasizing the value of testing everything prior to putting it out there because in many ways, this is a perfectly valid description for a job. And without knowledge of the image, there's no way that you could write a sort of admission controller to catch this sort of thing. And so you're left in a scenario where there's things that can sort of creep out, that can have some really catastrophic and unintentional consequences. I think it was sort of a very hurried bit of Kube control delete that intervened there. So another lesson that we learned is that Kubernetes services are not load balancers. So I mean, many people might already be familiar with the sort of thing that they provide, but a Kubernetes service, it really provides a really convenient and powerful way to retrieve a list of pods that are sitting behind some service. It's very tempting to use this to actually distribute all the incoming requests to a particular workload or a service. And under the hood, what's being used is IP tables to help distribute that traffic among all the backend pods that are ready to receive that traffic. And then a node goes down. And then a node goes down or a network partition occurs or any other number of things. And the problem is, IP tables is not really that intelligent, it displays this traffic over all of those endpoints. Some of those endpoints can be dead and what you end up with is requests to Kubernetes services that can just go into a black hole. So just briefly to sort of set up here is how does this work? That looks a little weird. There we go. Like we said, you get a Kubernetes service, it gets some magical mythical IP that's effectively a series of IP tables rules on the host and those map to addresses that end up getting denadded to real pod IPs that are actually containing the workloads in some sort of probabilistic matter. So a lot of things go into whether a pod is in that list of endpoints. So is the node online? Has the controller manager detected a down node? Lots of complex timers involved in sort of detecting whether those pods are online or offline. Is the pod in a ready state? And effectively it's not a full proof system and it doesn't react immediately in real time. And why does this matter to us? So this matters because I think we need to look at what a typical open stack service request looks like. So I mean when you do something like sort of even something very simple like listing heat stacks this is what occurs. This is a simplified version but something like this. When you first request a Keystone token it goes and talks to MariaDB. It emits an event to RabbitMQ for CADF notifications. It hits memcache. Then you get that token back. It then goes back to an API service which does a very, very similar set of things. It tries to talk to HeatEngine via RabbitMQ and eventually through this sort of Byzantine labyrinth you get your response back. And it only takes one bad, bad actor within this chain and you start seeing failed responses. So as we're saying as these interdependencies grow so does the likelihood that a single node failure or network partition or anything like that will interrupt part of this flow. And the most important thing is that from the outside observer the person doing a heat stack list, the request failed. They don't care that we did 90% of this if we can't get to the last mile, it doesn't really matter. And it's pretty hard for you to debug because when you first go into a cluster and have a look at it, everything looks okay. And you end up at that stage of really having to dig much deeper than you would like to see what's going wrong. So, oh, go ahead. Oh, so what's the solution to this? I think from our perspective, Kubernetes services are really useful for looking up available endpoints. And you shouldn't really be using cluster IPs at least with IP tables for services that cannot tolerate drop requests. We're starting to explore using things like ingress controllers either Nginx or HA proxy where templates enhance to actually do valid checking and timeouts and retries. So if we do start to see failures within our infrastructure, we might see some sort of slightly slower responses, but they at least will always eventually succeed. So one of the other lessons is that self-hosting log management is hard. The CNF recommends Fluent D, which is great, but if you're actually building Kubernetes clusters and you have logs that are important at the time that you're building the cluster, your logging frameworks might not actually be up at the time you're building those clusters. And also, how do you monitor the bootstrapping of new Kubernetes clusters when your logging and monitoring platform might not be in place yet? You sort of get vicious cycles of the log management system, Fluent D, Elasticsearch, they can introduce enough load themselves to actually hurt the cluster. And finally, really consistent challenge is that OpenStack really does not handle its logging system disappearing on it very well. And so you end up potentially with a tight coupling between something like OpenStack and Fluent D if you choose to go down the path of that integration and that just you end up with one more dependency and one more thing that can fail that can hurt your OpenStack services. I think it's partly because of that we're considering sort of going to more primitive approaches. Yeah. Okay. So this, I think Alan can probably talk better about this example than I can, but I think it really exemplifies why Kubernetes is hard in the sense that there are a lot of moving parts, probably a lot less moving parts than some other traditional infrastructure management like OpenStack, but you might sort of feel that you've reached a level of security or other things and then suddenly later on discover that there's a backdoor one. Yeah, and this might be an issue that many people are familiar with as CVE was created for this a while back, but it's also just one of those things where it was open for quite some time and it is just a configuration parameter that you need to be configuring your cooblets with but most people weren't configuring it that way because that wasn't the default. And so this is an example of effectively being able to run privileged commands with a simple rest request. I mean, in the black boxes that we have here we have in some total described to you how to talk to the cooblet and run a command on root as root on the host and that's quite a big deal. And so I think it just sort of speaks to Kubernetes has a simplistic number of components but it's still hard at the end of the day. And it's really easy to miss things like turning off anonymous auth. Because when the default is false you tend to move on. And I think from my perspective this is why it became a CVE in a sense. So with 50 seconds, any questions? I think there's a microphone. Hi, that was a really good talk. Just a quick question. When you talk about corrupters, Docker you're talking about containers running on your production workload that have corrupted, that have become corrupted, is that correct? Yeah, or various state files for Docker on the file system of the host have become corrupted. So can you talk about some of the range of behaviors that you've seen of a corrupted Docker instance or container running that mean it didn't fail the health check and sort of the tools that Kubernetes gives you to validate that it's okay? I think thankfully, I can't think of an instance where we've actually seen that. We've seen many varied reasons for an inability to run a workload that I am unaware of a time when we've seen workloads running but not running as they should. Yeah, most of them are startup failures. And as we mentioned, we don't really care about the individual containers themselves. So it's just a matter of actually literally purging them and letting them be reinstantiated from round zero. Thanks. If this question's hard, we can just say we're out of time. Yes, let's hear it first. This work, after all this work do you think it's really worth to run OpenStack Control Plane on Kubernetes because OpenStack Control Plane was never designed to be cloud native? Yes, without question. I think a lot of the challenges we faced have not been unique. I mean, we've described some of the unique things to Kubernetes here, but in general, it forces you to build a very resilient deployment. And deal with the same problems that you should be doing if you're deploying via Ansible or managing via Puppet that puts those things right up front and center. And then the other advantages you get through sort of the ability to perform reconfiguration, check the state of things, really pays off. Okay, thank you. I think that's the get off. Okay, thank you.