 Welcome everyone. My name is Anmol Krishnasastava. Hope you all are enjoying Q-Con. Today I'll be talking about managing container lifecycle correctly. Before moving forward, let me introduce myself. I'm a site reliability engineer and currently I'm working at the Willx Group. I'm an international tech speaker, a distinguished guest lecturer and also I have represented India at Reputed International Hackathons. I love doing research in the field of deep learning and computational neuroscience and have eight plus publications to my account. I describe myself as an all-stack developer, that is a person who is capable of designing and developing solutions for platforms like web, mobile, desktop and embedded systems. Apart from that, I also like mentoring people. Let's have a look at the Willx Group. Willx Group is a global product and tech group consisting of 20 plus firms. We are an online buying, selling and exchange platform, serving approximately 350 million people per month. And also we are present in around 45 countries across the five continents. We have more than 10 million online listings every single month and we have billions of visits per day. Also we have hundreds of thousands of cash reads per second and all of this is packed by hundreds of microservices that we run in our Kubernetes clusters. That's fantastic. Let's have a look at the infrastructure landscape. Mostly the tools and platforms that we use at Willx belong to the CNCF landscape. I'll not be drilling down into details of this. Let's have a look at the agenda of today's talk. Today's talk is divided into six segments. The first segment is about UNIX processes and in its systems. We'll be talking about zombies and orphans also in this section. The next section is about managed lifecycle. Here we'll be talking about port and container lifecycle along with Linux signal handling. The third section talks about resiliency and high availability through health checks and probes. Specifically, we'll be talking about liveness probe, readiness probe and startup probe. The fourth segment is about lifecycle hooks and graceful domination of port. We're going down into deep detail of how graceful domination happens. The fifth section is all about unit containers and its detailed working. The sixth segment is about unit container, the comparison between unit container startup probe and the post startup. So let's get started. In this section, we'll be discussing about processes and unit systems. We'll also be talking about zombies and orphans and how to deal with them inside the containers. So just to give you an overview of what UNIX processes are, a UNIX process is basically an instance of a running application and the processes are ordered in form of a tree. Each process can spawn several trial processes. We can see on the left-hand side, there's a process, the topmost process, which is called as the INET process or the PID-1 process. It is a process which is started by the kernel at the time of boot and it takes care of spinning the rest of the system processes. We can see that PID-1 or the INET process is the main process. It has two children, one is PID-2 running SSHD and one is PID-3 running INGINX. PID-2 further has created another process, which is PID-4 and is running BASH. Let's have a look at zombies. So what are zombies? Suppose a process is there, consider PID-4 in our case and this process has terminated. Once the process has terminated, it is referred to as defunct or zombie process. What a zombie process basically means, it's a process which has terminated but has not been waited for by its parent. Now what means by waited for? Waited for basically means that the parent actually waits for the child to return its exit code or the exit status so that the parent can actually release the resources that it is holding. And this process is often called as reaping. So basically parent triggers wait PID system call here. The flow is that there is a sick child signal which the child process when terminating generates. The sick child signal is sent to the parent process ID and then the parent actually calls this wait PID system call. Once this wait PID system call is triggered, the reaping starts. So in nutshell, zombies are the processes that have terminated but have not yet been waited for by their parent processes. Now what happens if a process loses its parent? Consider in our case PID-4 which is a child process of PID-2 but PID-2 has somehow got terminated. From now onwards, PID-4 will be called as often process because it doesn't have any parent. In Unix systems, PID-1 is responsible for reparenting the child to itself. So PID-4 will now become child of PID-1 because PID-1 will reparent PID-4. With this, we wrap up our overview of zombies and orphans. Now let's see how zombies are harmful. For each zombie process, there's an entry in the process table and zombie processes keep on acquiring the kernel resources though in a minimal fashion. If the number of zombie processes is high, then the creation of new processes may not be possible because resource starvation may be there. Having zombies inside the containers also poses some challenges. Generally, one main application process runs per container and it is treated as PID-1. So whatever we specify in the entry point of the container is treated as PID-1. Now say we have coded an application to solve a specific purpose. This is not meant to suffice in Unix systems functionalities. What if there are many zombies getting created? Our process will not be able to reap them. Also if we are using some third party managed Docker containers, we are not sure whether they are actually having the process being treated as an unit process or not, whether they have the functionality of an unit system or not. So this also poses some challenges. There comes a need of having a proper unit system in our containers. Now sometimes people use bash but the thing is that bash is able to perform some of the reaping functions but is not able to handle the signals properly. It's not able to pass the signals which it receives from the operating system to the child processes. Now let's talk about some sophisticated unit systems. Upstart and system D are two options but these are heavyweight systems. We have tiny and dominant. These are lightweight systems. So we'll be taking into account tiny in this stock. Tiny is an unit system which is an open source system and it's suitable for Docker containers and also it's suitable for production environments. It's simple and lightweight. It also reaps zombies and does the signal forwarding properly. Adding or removing tiny doesn't have any negative impact. So let's not wait and get started with setting up tiny. Setting up tiny is pretty straightforward. On the left hand side you see a code snippet. This is the Docker file. The first four lines actually are the contents of a Docker file which contains the command to run a Python program inside the top container. Line 7 to 13 are relevant for us so we'll be focusing on them. We just need to specify a version of tiny and the remote URL from where we can fix the release. Then we just set the permission of this binary and then we provide the entry point. Entry point can be provided as tiny followed by two hyphens. Then we just need to supply a regular command like we did before. Now this defun.py which is a program that I want to run inside the container, its content are on the right hand side. This program actually shows how zombies are created and how orphans are read. So we'll be having a quick look in the next 10. So let's look at the configuration of the pod resources for both of them. One is for the version which doesn't contain tiny, the first one, which is tiny disabled YAML. The second one is for the one which contains tiny in it. It's a simple Python script that I showed which will be running in both of them. The first one should show defunct processes and the second one should not show defunct processes because the parent should be able to read them. Let me run both of them. So we see at the top that the two containers are running and since these are short-lived containers, they will be just going into crash loop back and then restarting themselves. Let me print the logs for both of them so you will see the difference. The first one actually shows the process as defunct process and it's also labeling the prd8 as zombie process. The second one also shows that prd8 will be zombie process but it actually reads the zombie process. So we are not having any defunct label like we had in the first case. So using tiny actually reads the zombie processes. This is a simple example which shows the same. So let's discuss more details on tiny. Before moving forward, this is a script which I used for creating the zombies. You can have a look here. Okay, some more details. So tiny needs to run as prd1 in order to read zombies and if it is not able to run as prd1, it also has a provision to be run as a subreaper. So who is a subreaper? Subreaper is any process which is not running as prd1 but can actually perform the function of reading. So tiny can do it pretty simply. Just need to pass another argument who it is hyphen s. So it looks like tiny hyphen s followed by double hyphen s. And the special thing about tiny is that it exits with the child's exit code and remapping is also possible. So if you want to remap some say exit code to something else, we can do that with tiny. This wraps up the section of zombies, orphans and tiny system. Let's move on with the managed life cycle. Till now we have understood how processes work in containers and how to manage them in its systems. Now we'll be discussing about the life cycle of the container and talks. There are five distinct phases of a false life cycle. These are pending, running, unknown, succeeded and failed. Fending state means that the pod has been created through the API and is waiting for some node to get scheduled on. The running state means that the pod is operational and is running fine. Unknown state is a rare occurrence and it means that the Kubernetes cluster has some internal problem due to which it is unable to communicate with the pod. Then comes the last two states these are often found when we use ground jobs. One is succeeded which means that the pod has finished normally. The other one is failed which means the pod has crashed. Kubernetes also watches the state of all the containers and containers have three states in their life cycle. The first one is waiting, the second one is running, the third one is terminated. Waiting state means that the container is performing some operations which are required before the startup. It may be pulling images or applying secrets. Running means that it is running without any issues and terminated means it may have suffered some failure or it may have succeeded. To know about the exact reason for termination of the container, one may use the Qt control describe pod command. Now let's talk about something really important, the graceful termination process. When talking about an application's performance and behavior, one thing to consider is that whether the application handles the termination process gracefully or not. Handling the termination process gracefully, it means whether any cleanups are required or not. Before terminating, there may be some need of cleaning up files, there may be some need of cleaning up the resources, maybe releasing some connections, making transactional commits and if not done, this may impact the performance of the application and this may impact the users also. Force terminations are a big threat. Force terminations could often lead to degraded performance or even some time times. The outage may be there because the application may have ended up in an improper state because of the forceful termination. Now let's talk about the two important Linux signals that form the part of the pod termination process, sick term and sick queue. Sick term can be considered as a gentle poke to the container to cause termination of the processes. It doesn't cause any immediate termination and this signal can be handled or even ignored. On the other hand, sick kill is like a hard kill. It's analogous to the kill-9 command that we use to kill the processes. This cannot be handled and it's like cutting the power of the machine. Now let's talk about the termination life cycle. First, the grace period is set and the default is 30 seconds. Here the pod enters the terminating state and stops getting any sort of traffic. Next is the execution of pre-stop hook if it exists and we will cover the details of this in a bit. Then comes the sick term signal which is sent to PID one of each container that is there in the pod. Here comes the role of the unit system and signal handling that we talked about earlier today. If one is using an unit system and then it can be ensured that proper system forwarding is happening. However, it still depends on the application whether it can handle the signal or not. We'll discuss about this and how to deal with such situations shortly. Then comes the fourth stage. It is when the grace period ends and the sick kill is actually issued. Then the API server deletes the pod's API object and finally the pod terminates. Let's analyze the termination life cycle through a time series graph. Suppose the pod enters the terminating state here and stops receiving any sort of traffic. Grace period is set and the pre-stop hook starts executing. Say the hook got executed by this point then sick term signal is issued to all the containers in the pod. Once the grace period ends sick kill is issued and the pod forcibly shuts down. This is the complete termination life cycle. All right, so now is the time to learn about achieving resiliency and high availability through the use of health checks and probes. Let us understand the help of pattern first and its need. Kubernetes should know the state of the pod so that it can decide whether to send the request to the pod or not. All of this becomes easy if the container exposes some APIs for different kinds of health checks. Kubernetes containers are self-healing entities. There is a component called the kubelet which runs on each note and is responsible for bringing up the containers and keeping them running. The kubelet even restarts the containers in case if there is any crash. This is done by doing generic health checks against the containers and it is called the process health check. A container's main process can crash due to numerous reasons like seg faults can happen or some unknown bug may be there. In such situations the health checks help a lot. Let us look at some more problems in detail. What if an application stops working without its main process crashing? It's not weird and it's common. Deadlocks, memory leaks, infinite loops, thrashing and many other reasons may be there. Applications should be able to handle some of the mentioned problems by using some complex logic. However, there needs to be some sophisticated, reliable and an easier way to tackle such problems. Other services should not be sending requests to crash applications. So let's see and discover some of the effective ways to tackle such problems. Probes offer solutions to such problems. A probe is basically a diagnostic performed by the kubelet on the containers in a periodic fashion. It helps in achieving resiliency and also helps in better load balancing and routing of traffic. Since the pods which are not ready to receive the traffic because maybe their containers are unhealthy will have either their containers restarted or the traffic will not be sent to them. This will also ensure timely response to the requests. Now let's look at some technicalities of probes. Robing is basically possible via calling handlers implemented by the containers and there are three types of handlers, exec, TCP and HTTP. Exec is the one which executes some code and expects exec code zero. TCP socket check is performed by our TCP check against the specified port. HTTP get request is also possible on a specified IP port combination along with the paths. It expects a response code in the range of 200 to 399. The resultant states can be success, failure or unknown. Now coming to the probes, there are three types of probes that we will be covering. One is lightness, one is readiness probe and other one is startup probe. Talking about the lightness probe, it helps in identifying whether the container is alive or dead. In case a failure is observed in the lightness probe, the kubelet kills the container and then restarts it. Whether the container will restart actually or not depends on the restart policy of the container which can be always, never or on failure. We will see the implementation of lightness probe shortly. Before moving forward, let me give you some tips. First one is that always define a lightness probe for port surrounding production. It is really important. Second is have the application expose a health check API in the format of say slash health or something like that. The health check API should not require any kind of authentication, else the probe will always fail. This is a point that should be noted. Then keep it light on computational resources. Don't put much complex logic in the lightness probe section. Probe CPU time is part of the container CPU time quota so you should not be putting any kind of complex logic in the lightness probe. Before moving forward with the demo, I would first like to cover the concept of readiness probe. A readiness probe signals whether a container is ready to accept new connections or not. Say during the startup, some warmup procedure is to be followed and this may take some time so the container can actually delay sending requests to the port using the readiness probe. Another use case can be to stop sending requests to the port when the container is actually overloaded. It must be noted that until all the containers are ready for a port, the port isn't treated to be ready. Unlike lightness probe on failure in readiness probe, a container isn't killed. It should also be noted that after receiving a signal on signal say, even though if the readiness check passes, Kubernetes tries not to send new request to the container. Now let us understand how to use lightness and readiness probes in Kubernetes using the code. For demonstrating the usefulness of lightness and readiness probe, I have built a small yet powerful application powered by Flask that will help in understanding the probes easily. On the left-hand side, you see that there is snippet from the port resource manifest which shows the usage of lightness and readiness probes. On the right-hand side, you see a snippet from the Flask application which shows some APIs and route technology. Let us look at the port manifest first. Here, inside the container, you have the regular image policy and name. Then you see two new sections starting at 10 and 17. Lightness probe and readiness probe. Lightness probe and readiness probe both in this example are using the HTTP Git probing mechanism. They have the path set to health life and health ready respectively. The port is 5000 because 5000 is the port on which Flask application is running. There are three new terms which you can see. Initial delay seconds, failure threshold, and period seconds. Initial delay seconds is basically the time by which we have to delay the probe. The probe will start after two seconds of the start of container in this case. Likewise, the readiness probe will start after two seconds of the container. Basically, there will be a delay of two seconds and then the probes will start working. You can have different values, lightness and readiness probe respectively. Then you have failure threshold. Failure threshold specifies how many times the probe is allowed to fail. In my case, I have specified as two and two in both. Then period seconds basically sets the periodicity or the frequency after which the probe should hit again the application. Where it's two seconds in our case. Coming to the application code, this is just a snippet and the application code is a bit huge. It actually has different routes. So health ready is one, health ready, stop ready is there, health start ready is there. Health ready basically tells whether the pod is ready or not. Just it prints something and then gives 200 as the code if the pod is ready. It prints the pod is not ready and gives 502 code if the pod is actually not ready. So I'm just implementing a native logic here in which when I hit health stop ready, then it actually turns the pod ready variable to one. And when it finds in health ready state, when it finds that pod ready is not equal to zero, then it actually treats the pod as not ready. Likewise, I set pod ready to be equal to zero here in health start ready. And we can play around this in the demo. So let's have a look at the demo. So just for quick demo, here's the file, here's the pod manifest file, which contains the lightness probe and readiness probe sections as explained in the slides. So let's apply this YAML and see how the pods react. So you will see the pods are coming. So basically there's only one pod. So Python Health is the name of the pod. So I'll just start tailing the logs. So you see live and ready. These two are the APIs which are hitting hit. And these are basically the hits coming from the probes. So you'll see live ready, live ready. This will keep on continuing. Now let me go to the browser and show you how the app looks like. So I haven't started the forwarding. So let me start the forwarding for this. Let me forward it to 8989. Now I have started the forwarding. Let me hit 8989. And I should see some response. Hello from Python. Now let me try and hit here health live. And it should hopefully give me the pod is live, right? I'll hit it multiple times. And also, just so that you can see the response code, let me open the network console. So it's giving me 200 again, 200. Okay, now let's try with ready. Okay. So we see that here we are always getting 200 as a response. Now, let me do one thing. Let me stop the readiness. Before hitting enter, just see that here we are having one one as the ready state. So one one running here as the ready state. And here we are having 200 as the response for ready. Ready. Now let me hit on stop. And I got ready has been stopped. Now let me hit on ready. And it's giving me 502 is expected. Let's see these these purple colored lines. These are basically the lines when the readiness probe started failing. And the port has gone into zero one state. Now let me resume the readiness state for this by hitting the API again. So chart ready should actually do. And then if I hit ready again, I'm able to see the pod in running state. So see the purple lines have gone. The pod has again become ready. So basically what I did was I failed the readiness probe so that the traffic doesn't get crowded to my part. Now we'll see what happens when I stop the lightness probe. As in I failed the lightness probe. So I have done this and it should actually kill the part kill the container basically not the part. You see you were getting purple lined healthy health life. And now the part must be the container must be serving its graceful period. And shortly we should see a restart happening here. And what we can do is simultaneously we can see the describe output of the parts here. So you see lightness probe failed with 502. And then you can see the restart count also won't lie. So that's all for the demo for the lightness probe and readiness probe. Let's get back to the slides and start with the next section. So before moving forward with the next section, we need to discuss about startup probe also. So startup probe is basically a probe which indicates whether the application within the container has started a lot. All the other probes are disabled until the startup probe succeeds. And also it is mainly used with slow starting containers. We use a decent failure threshold with startup probe approximately maybe say 10 or 15. And it is meant to be executed at a startup only unlike others which run periodically. It may share the same probing mechanism as that of lightness probe and readiness probe. And in case of STTP get method, they all all the three of them can use the same path also. But the behavior of the three probes is different. Let's quickly have a look at the lifecycle hooks lifecycle hooks are actually required for managing the container lifecycle in a better manner. Since only signal handling is not the thing which we need to worry about. So there are two lifecycle hooks available. One is post start and one is pre-stop post start hook is actually executed just after the container starts and it runs parallely with the main container. It can be used to implement some warm up logic or maybe a signal to an external listener about the application getting started. Also it can be used to do some pre-conditional checks. It must be noted that there are no grantees of the post start hook running and also it makes the container stay in the waiting state till it has executed fully and keeps the pod in pending state. It may also happen that the hook gets executed fully even before the main process has started fully. And in case of any failure, no retrans happened. The container restarts depending on the restart policy of the container. Now talking about the pre-stop hook, it's a call that is sent to the container before it is terminated and it triggers the graceful termination process. So basically it is used to execute some graceful termination logic either outside the application or by helping some application endpoint, which can be triggered, which can trigger actually the graceful termination of the app. In case of third party managed containers, also this comes handy. We at OLX are using pre-stop hook heavily and to quote an example, we have a chat server powered by Azure Bird and it uses the pre-stop hook to clean up the Redis connections and entries on termination of the Azure Bird pod. Let's revisit the termination lifecycle graph and see where the pre-stop hook fits in. So as you can see here, the pre-stop hook is actually immediately called as the grace period starts and it ends before the signal actually starts. So it can be used to handle graceful termination effectively and even for the cases where the application don't implicitly support graceful termination, this can be used to have the graceful termination done for those applications. Now let's quickly see how these hooks can be implemented in Kubernetes. So here is the snippet of the code. I'm actually adding a line in both the hooks to index.html of nginx. So I'm using nginx image. Line 9 actually states the lifecycle section and inside that we have post.mprestop. Then we have a command here specified for post.mprestop respectively. In pre-stop, I'm doing additionally the pre-nove of nginx after the hook actually executes. So let me apply the configuration now. This will take 20 seconds as I have specified this in the post.hooks command. Okay, so it has started running. Let me port forward it and then let me try and open here. So you see post up. Now let me delete this and we'll see pre-stop has here started coming here and the port is now solving its graceful termination period and also it must have cleaned the nginx process. All right, now is the time to discuss about the init containers. So init containers are specialized containers that run before the application containers run. And that also means that they are separate from the application containers and run on separate images. These may contain some setup which is not present in the main application image and multiple init containers can run inside a port and almost run successfully and sequentially in the specified order in the manifest. Also the init containers don't support any sort of hooks or hooks that we have seen so far. In case if there's any failure with the init containers encounters, it depends on the port's restart policy whether to restart the init container or not. If it is set to always, then it will always restart. Also init containers can share the same volumes and that of the application containers and altering any kind of code in the init container leads to restarting of the port. Let us now see how to implement init containers in Kubernetes. Here I'm using an nginx container which is an application container and a busybox container which is the init container. Both are using the same empty directory volume. The init container just modifies the index file of the nginx container and let's see how it happens. So let me apply the manifest. The port has started initializing. You can see init 0 of 1. That means one container init container is there out of which 0 has been initialized. Now the port is coming to the start state. It's initializing basically and it's running. Let's go to nginx and see what it is displaying. It is displaying hello1 from init container. This shows that init container was able to modify in the nginx index.html file. Now since we have seen how the init container works, let's see at the usage of init containers. Init containers can be used for delaying the application container startup or can be used to perform pre-conditioned checks. They can also be used to run utilities or port that is not part of the application container. They can be used to seek data in database before the application starts. Even it can be run to configure things at the time and wait for something to become available. Maybe a DB, maybe a service that needs to be available before our application starts. It can also perform database schema operations and also prepare the schema and it can also be used to create user accounts. Several use cases are there wherein we can use the init containers. Let's get started with the scheduling and resources of init containers. Init containers and application containers coexist inside a port. The effective request or limit for the resource of port depends on what we specify for both init containers and the application containers. The effective request limit for a port's resource is the higher of the sum of requests limit for the resource of all the application containers that are present and the effective request limit for the resource of the init containers. Wherein the effective request limit for the resource of the init containers is the highest of any particular resource request limit that is defined on all the init containers. With this, we wrap up our section and we move on with the phase off. This is the comparison between init container, startup probe and post.hook. First, let's set the parameters based on the container. Post.hook can be used inside the same container as well as startup probe can be used inside the same container. But init container requires a separate container. So application container is different and init container is different. Then the scope. So scope of post.hook is limited to a container. Likewise scope of the startup probe is limited to a container. While the init container scope is not restricted to a container but to the whole port. So init containers are bound to the port and not some particular application container. Now running the container image, so init container has the freedom to run the same or separate image but the post.hook and startup probe don't have this privilege. So they run on the same image as that of the application because they are running inside the same container. Then the run guarantees. So post.hook has no guarantee at all and the rest of the two must run successfully in order to proceed forward. Then talking about the failure thresholds and restarts. So startup probe can have the threshold specified and those should be decent in number should be a bit higher than what we specify for likeness probe and readiness probe. For post.hooks, we don't have any threshold, but restarts happen depending on the port's restart policy. And for the init containers, restart actually happens again depending on the port's restart policy. And in case the post.hook actually fails, then the container fails and the container activates and restarts. Usage. So usage is almost similar but the distinct things I'll mention here. So post.start is generally used to signal the external listeners that my application is going to start now or maybe it can be used for pre-conditional checks and maybe introducing some queries. Specifically init containers I have covered separately. So this can be used for various initialization purposes and for startup probes it is appropriate for slow starting containers and we must be specifying some huge number in the failure threshold. Last is the count. So init containers can be multiple in number. So we can have 10 init containers depending on say for needs but post.hook can be one and we can choose between the mechanisms that are present for the post.hook. Likewise there can be only one startup probe and we can choose between the different mechanisms for probing. Thanks a lot for joining this talk. Now it's the time for QA. It's a bit late only. We are left with almost two minutes or so. So we can join in Slack at this channel, 2Qcon101. Hope you enjoyed the conference and hope you enjoyed my talk also. See you in the Slack channel now.