 Hi, thank you for tuning in. I am Sunita Moth. I work as a director of AI model systems at NeaMap. Today, we are going to talk about a peculiar kind of crime scene investigation. This is QCSI and how focus is who killed my part exactly who has done it. Every intriguing investigation begins with a disaster and ours is no different. It started off by QCtrl applied my pixie dust magical app which is we thought is well tested in local environment and bare metal and as soon as we deployed all hell broke loose. We started to see a container being killed left and right with OM kill as a reason and error code exit code 137 and this is then going into crash loop back which is of your joy to look at when you are getting things out into production. So then that sparked a massive hunt for what exactly happened and who killed the part. So let's start with the investigation mode. We have a particular container process that's being killed repeatedly potentially because it's being a memory hogger. Now we know that the kill is a forceful brutal and it's killed with a sick kill as a signal. So it's either somebody in somebody with more power has killed the process without the desired wishes and it's a runtime error then. Okay so we know some context and let's look at the details for how the containers run on a host machine and then who are all in charge of in making the container process lifecycle run on a host. So we obviously have a host the bare metal or the instance or virtual and then on top we have the container runtime and container runtime is comprised of two different levels the lower level runtime and higher level and time the lower level deals with interacting with systems host using system calls and making use of kernel and OS features container D which is the higher level runtime then deals with the higher level APIs that interfaces users and deals with image management etc aspects. So then we have the runtime that's broken down into two levels low level and high level runcy and container D are two very popular examples of these runtimes obviously there are many different other runtimes. So now we have container runtime and then your process that we call container then runs on top of it and the layer underneath the runtime and the host then guarantees the isolation for your container process so they are kind of not stepping on to each other's toes. So this is what happens on the host level. Now when we put Kubernetes in the mix that is a container orchestrator not just on one host on many hosts then we need another process who is responsible for managing container processes on the host on its own way and also managing across all the other servers which is essentially what Kubernetes is a part of it I should say has become very big. So that process that we're talking about that deals the management at the host level is called tubelet that is a system D process that manages your process running and the way it does it is by implicitly talking to your container runtime that's provisioned on your node. Now that then talks to image registry and deals with the container process lifecycle. Now CAdvisor is a way to grab container resources utilization and stacks about it which is a part of a tubelet. So tubelet then talks to the container runtime to facilitate the lifecycle of container processes and it also has CAdvisor's parametric collection. Now there are other parts in that figure that I'm not going to touch upon very much because that's kind of not the focus of the discussion. Now in current version all the way up to K20 the way tubelet talks to container runtime is via a thin layer called docker shim which is a means to mitigate the communication or overhead with talking to docker. This is all going to change a lot in K20 and 22 but for what we're talking about it's not so much relevant. Okay so coming back to our actual investigation then we know that our containers are being forcefully killed but it's not through manual intervention then based on what we have discussed it's either the tubelet who is in charge of managing the process running from a cube viewpoint or an OS kernel then who is in charge of running processes on the host. One of the tools are killing the process there. It doesn't look like there could be other suspects that we would have to deal with. So let's talk about what container processes and the isolation how that's guaranteed at the OS level. Namespace and cgroup these are the two important aspects that allow for the isolation of the container process. Now namespace is responsible for isolations from a file system networking in other aspects. Cgroup on the other hand is a feature of kernel that allows for hierarchical management of resource and resource requirements and cgroup memory and other features or resources can be used to manage via cgroup. Now to elaborate more on that I think a bit of a demo would help and that's what we would now look at. Okay so what I'm going to do now is start a container process and the container process that I'm going to start is essentially iPython on our container from numpydev and all I'm doing here is essentially saying give me iPython as a process from the image numpydev and run it and now what I want to do is I want to look at the stats for how much footprint this container is is running with and as you can see it's telling here that it's using 35 meg of memory and some GPU and now what I'm going to do is start the same container again under different name with 40 meg as memory requirements. Now if you see I have two processes running they have been running for a little bit now let's talk cgroup here. So cgroup is managed through a file system in a way like this and you can see all the kind of resources that cgroup then would manage. Now if I look into sorry look into memory I would see these are the kind of aspects of memory that's then managed by cgroup. What's of interest is kernel memory and also memory in limits and in few other aspects. Okay so if you look closely here you would see a folder here called docker. Yeah it's there. So under memory then we have docker and you can see there are two these IDs that corresponds to the container that we have run just now 4 4 3 7 4 4 3 7 and 6 5 c 6 c 6 c so this is a short ID and this is the full ID of the container now let's have a bit of a look into what the memory limits from the containers are and I'm just going to cat the memory limit type and all I'm doing here is say give me full full ID of my container that's 40 meg limit and you can see when I do this I am getting about 40 mag as a process. Now if I just go in here and make this I would see a really large number this is actually the maximum in 64 container by the multiple of 4 kb for page size which is the Linux page size so it's a way of saying that this container is running without a memory limit that is unlimited memory. Now let's do another thing let's bring up another container with a with the 20 mag memory and this is obviously not enough memory that we needed to run with and then we started the container now if we look at what happened we see that actually the 20 mag container was just killed. Now what we want to do is look at the inspect this container and all I'm doing doing is saying do I inspect on this container and I want to only look at the state measure and you can see it's saying this process was actually omkilled with an exit code of 137 and we could do the same thing with 40 mag container and we would see then it's not omkilled it's still running etc so we just simulated the omkill scenario in our local environment. Now I want to show you something interesting and this is about the swap error that we were getting earlier here it was saying that your container doesn't support swap capabilities and memory limited without swap so that means if I asked for container to have only 20 mag memory my container is only going to get limited to 20 mag because the swap which is a capability of systems to go beyond what's available in physical RAM and extend use a little bit more than what the capacity is and that aspect is disabled now so we we can't really use more than what we have but this is my mac now and I would run the same to come on like before so I'm running limited limitless kcsi and 20 mag kcsi and now you would see that when I run this actually the 20 mag container is actually still running it hasn't been killed and I could see local stacks on this and I would say it's still using 90 mag but it's actually making use of the swap for for the extra that it needs. Now let's do a differently let's give it a little let's make it swap list and I do that by explicitly specifying the same memory as swap and thereby disabling the swap and the moment I do that you would see that my container the strict one is killed which is exactly for the same reason and we could inspect on this and you could see that it's actually only killed for the same reason okay so we were talking about docker here like this um and if we are in a cube context as in if uh if the node is a member of a cube cluster then we would also see here cube pod at the same level as docker now we can see here in this example that there is actually one pod running in here in a and in the same way as we had a docker container running and if we look into that pod we would see that we have now two containers running inside this pod okay so presumably one is the POS container which is the container that first gets created to create the networking in the right new space and environments and then the actual process container is created so everything is set up beforehand but if we if the if this pod had two containers in it then we would see here three um three entries for three different containers including one for POS so that was about um there was about cube pod now quickly want to touch on the process and if we do okay so let's just so we said a container is a Linux process how do I know what is the process ID for my container and to do that is essentially this what we say is we say docker inspect and I want to inspect the process ID for this container with an ID that is that belongs to this particular limitless KCSI container so in a way of saying is give me we ask for the process ID by doing a docker inspect using the container ID and then we do a process tax and you can see that 3505 is the process ID on the host level as in the host C is that there's this 3505 ID and this is pointy 295 process now there's an interesting sudo file system based on fee for that Linux manages and this is basically under proc so proc is for process and we could say 3505 and then we would see all these properties of the process I guess some of them are some of them are the the stream processes are pipe processes that other systems use this to communicate then on now want to particularly talk about the limits here and you could and you can see all the limits are mentioned here in terms of various kinds of limits now particularly want to talk about O and score and this is basically this is how the system manages what is the score for OM for this process and in the event of a memory crunch the scores are used to decide which process should get killed there are actually three files on proc for OM one is the score which actually is used to manage the score or record the score the other are two adjustment files as in how much you want to offset the score and there's two for different Linux versions but that's not so much relevant for what we're talking about so we had to look at how C group works and allows for container processes isolation and we also look through how OM scores are managed and how container knows which process to then kill okay slightly changing the gear and talking more about how do we ensure the how do we ensure the quality of service and resource requirements so when a pod is running on a Kubernetes it allows users to specify the resource requirements and it does this in three different fashion one is you basically don't specify any limits and in that case the quality of service becomes the best effort and then the other one is best able where you give it a range and based on the range the queue will try to best place it these kind of pod give a quality of service as possible now the third one is guaranteed if you give the same if you give the same memory requirement as limit then the container the process is called guaranteed quality of service from the point we point and the guaranteed ones are the best kind of parts they are least likely to get killed in the event of OM and they're also good part in a way that it's easier for cube to manage in placing them there is no variability or fuzziness in what kind of resource requirement that would give us and in saying that the best effort parts are then the most likely to get killed because they are the most unpredictable from resource management we point so we talked about resource requirement and quality of service and how likely a pod is likely to get killed now in terms of resource there's two broad kind of resources compressible and incompressible compressible means if you breach your limits you are likely to get throttle but it won't be a deadly event incompressible on the other hand says that if you breach your limits cgroup or something would kill your process my memory is an incompressible resource and cpu then is is a compressible resource so based on what we have discussed I think the question here is can cube overcome it or can cube allow to overcome it and how about linux kernel and this is a very important question and it's not we are posing now it's been asked for many years from a kernel viewpoint and the common view is that it's a feature not a bug but there are contrary opinions as well around overcome it is a bug so can cube overcome it absolutely if we are allowing pods to run with a limit and it's possible that there are two pods running with a big range of limits and collectively together they breach the total available resources and mind you the swap spaces are disabled on communities for for the right reason so we cube can overcome it but it when it over commits it basically over commits without a backing to go further now sure same thing applies to cpu as well you can cube can overcome it from cpu and and a memory viewpoint and the os are configured to overcome it as well by default but they do have a swap space to fall back on to to kind of use a little bit more space when when the physical memory is is reached okay so going back to our actual crime scene when we profiled locally we never really saw the resource requirement to go beyond six gig it was kind of always in the 16th landscape landscape so we tried many different combinations of quality of service and trying to just mitigate the issue while we investigate how to fix it and finally we settled on to this really crazy 31 gig as a memory requirement for the container leading to the as an end of the systems to run and but we would still see omk we would still see the guaranteed parts being killed even though we don't really notice any spikes or a case where the limits have been breached and at no point the node was found to be underduous it had enough memory left from the monitoring viewpoint that it was never really reaching its own to the limit so just reviewing our board back we know that container behaves kind of reasonably we are still getting a kill the quality of service is configured to the best capacity still there is a kill it is likely to be an overcome at a related issue and we know that the suspect is now either the kernel or cubelite okay so we had a bit of a lucky draw we found a critical events where we noticed a particular spike in the in the memory usage and this would go about all the way to a little over 28 gig still within the limits of 31 gig and it was a lot less frequent than actual events and like I said in the bounds of memory limit but we still don't know why the containers were getting killed but we do have an indication that sometimes our processes are spiking up memory usage based on some unknown reason okay so we talked about om scores before what happens how does the om workflow gets triggered so linux manages this k message kernel messaging stream where all these events are then put in now k message kernel messaging parser then parses it and translates it into events through event hardware now the two events that we are interested in is om event and om kill event om event kind of indicating hey memory surge is happening there is there is actually an om situation and the other one is okay we have an om we just need to do a recovery let's start the killing which is the om kill and then there are watchers which would watch to these events and take appropriate action so we would have watched tubelet would have the watch implemented and so would be the some some processing in linux kernel okay so there is nothing in application log that would indicate any problem with memory search but we know there is a search very rarely we don't see any event of any kind of om on cube system parts cube system processes or in any of the event log as far as cube is concerned the node is all healthy and it's all Hapuna Matata okay coming back to again so then based on that we conclude that it's not actually tubelet who is killing the process so let's look at kernel logs then and this is where we landed up then this is an excerpt from our kernel log and we can see that actually om kill was invoked and it was invoked because cgroup ran out of memory and you can see here in here that it's actually om kill invoked and then cube part this is the application part was killed as a result of memory limit and the actual usage was 31455 or 31.4 gig to 31.45 gig again there is a difference of about 1.5 thousand kb that it's still kind of short of the reaching memory limit but it's almost reaching and you can see that swap usage is still zero and on the os level the swap is unlimited so like you know memory usage is still unlimited from from a system you find kernel is using a little bit of memory but it's it's it's not the application usage so much and it has its own limit as you were seeing earlier okay so talking about the some of the metrics of cgroup here now RSS is the resident set size which is a portion of the memory occupied by the process that's held in in RAM and it's it's represents the anonymous memory and the subspace memory now RSS huge on the other hand is actually the anonymous transparent memory that corresponds to the huge pages which is which is when four kb page size is not enough for for for the sort of requirement you have if you're using large memory allocation and in that case a page size will become a lot bigger than four kb and in that case your resident set size huge would indicate the anonymous huge page memory okay and then we also know that anonymous memory often abbreviated as anon is is a memory mapped with no device or file backing and then it also indicates the memory that's on heap and stack so coming back to the second part of the log as you can see here we are saying that the part with usage the actual usage is this and then within that part there are two containers one is the post container which is really using very little memory so not so much the focus right now but this is the application part and you can see it is actually using an active anonymous memory of 31.38 gig and cache is a little bit the RSS memory is also then 31.38 gig and the huge page memory is 29.9 gig so all in all there is only very little memory that's being used that's not on the huge pages and but we are still very much within the limit of container which was 31.45 as it was said so so this is where we are we are almost the process is almost reaching the limit and it's getting killed the OM is getting invoked and it is getting killed there is still enough memory available on the nodes the node is not under due risk and if you look at this total VM we could see that we have a really ridiculous amount of virtual memory that's being allocated 62.59 meg is being allocated and it does look like we have some kind of unused memory allocation that's that's going on but coming back to like you know why the process was killed we know that it was because it was reaching the the memory limit it was just a little bit shy I to be honest I'm not sure why it was still getting killed when it was just a shy of the requirements but it is getting killed just being on the shy of requirements and yeah I must mention if you haven't seen Ian Lewis blog on almighty container and if you're I want to know more about what past container is I highly recommend reading through that blog so then as I mentioned we were quite confused why the process is getting killed despite being shy of the limit but given it's only a few few kb shy we're just going to accept that it is getting killed there now the problem we have is when we get over and kill the behaviors are exactly like that it's a runtime error you can't recover from it gracefully and you can't even fail safely and this is exactly not what we want so what we want is we want a safe recovery and handle more in a predictable fashion so we look that disabling the overcome it and save it failing more at an application level rather than abruptly being killed and there's two settings on VM level where you could say overcome it or memory is disabled too far disabled here they are too far disabled and then there's another ratio aspect which which is only used when you disable the overcome it and it's a way of saying that I don't want my system to overcome it and only commit to their total allowed memory up until cube 18 this there was no way to provide this setting if you provide this by a user data script on your node provisioning side it would essentially get overwritten in cube 18 onwards the system parameters property that now kops allows that you can use to configure your cube clusters but even in that the overcommitter setting explicitly overcome it memory is equal to two is actually disabled there is no way to systematically provide this setting and make sure your nodes are actually disabling the overcome it so really the only choice we have here is to to on the fly update the nodes after they have come up and they've joined the cube cluster to override this overcommitter setting so we kind of follow the script to to update all the nodes with this overcommitter setting we obviously there's probably a reason why it's disabled but we don't know the details and that's why we selectively apply only for a particular fleet of nodes where this particular process goes in so just reviewing our board we know who are who who the killer was it was always kernel and we also know that actually our well behaved application was not truly well behaving but disabling the overcommitter definitely mitigated the issue we can see that the process has been running for about nine hours without any interruption when the case would happen at least 15 20 minutes so so frequently so this was a good mitigation but obviously it is not a solution we have an application that's hogging more memory than it should and then we reviewed the application and bought the memory footprint download that mode chunking option to again help ameliorate this problem so all in all in summary we have a container that was killed because it was reaching its memory limit and it was killed by OS kernel and disabling the whole committer was mitigating the issue and but the actual fixes to reduce the memory footprint and guarantee the resource quality of service and resource requirements of the part to have more reliable experience on given days and that's all I had thank you very much I hope this was an interesting listening in to our experience and it was there was some learning for you in it if there are any questions I'm happy to thank you