 Good morning, everyone. Am I audible? Awesome. Before I start, how many of you have used Kubernetes? How many of you have used Kubernetes in production environments? Okay. So we all share the same thing. Today, I'll talk about pods. Pod as we all know, pods is one of the basic building blocks of Kubernetes. Just to introduce myself, my name is Chetan Shivshankar, and I work as a Kubernetes Tech Lead at Burkana. I have around 14 years of experience. My experience spans through the build and release engineering, the SRE, and the DevOps profile. So right now, I'm working at Burkana to build an awesome product called Debass. It's based on the operators for MySQL, Postgres, and MongoDB. So the plan is to provide the database as a service on these three database variations. By the way, this is a working custom resource, as long as you have the CRD, so it will work. So when I was a kid, I used to have this magnifying glass, and I used to burn that small piece of paper by trying to focus the sunlight. But today, hope I'll do something different and better. In this talk, we will talk about what is a pod in the context of containerity because nowadays there's so many projects, and we'll make it simple by just going through one of them. We will try to magnify from the higher level of Kubernetes cluster all the way down to the basic building of the pods and the containers. So how it all started was, when I was wanted to explore pods, the first thing like every software engineer does, I tried to search Google for what is a pod, and then what are the internals of pods? I didn't get any satisfactory results, so then I tried to go into the github.com, try pod, try to get any details on how it works. But all I got ended up was the pod specs, which to be honest, didn't give much information. So that's why I started to learn and explore more about pods. I thought to put up a presentation which gives an high-level overview rather than going through many multiple stuff and collating it. So what is a Kubernetes cluster in a very simplified way? All you have is a control plane, and you will have a set of worker nodes which are constantly communicating to the control plane, and then you have a Kubelet, which is nothing but a simple process, which is running on a worker node, which is constantly communicating with the API server. Kubelet ensures that pod runs on a specified node, or whichever it's scheduled on. So let's take the glass and try to see what actually happens on a worker node. So worker node could be a physical machine, or a virtual machine, or whatever. All it's basically is like you have a operating system running on top of it, and you have a Kubelet which is a process, and you have a bunch of process which is running on the operating system, and container runtime is also one of them. In this case, we will take an example of container D. So basically, container runtime will have something like a container runtime interface plug-in, which is actually an interface to have a communication between the Kubelet and the CRI, which makes this container runtime pluggable. You can have something like a container D, and you can replace it something with CRI or something like that. So that's the advantage. So this container runtime manages the containers or essentially the pod. I'll come exactly on what the pod is, but for now, let's assume that container runtime actually manages the container. So the next question is, what is a container? Container is a mechanism where you actually ship your application, and all the dependencies along with it in a like a small shipable stuff. Essentially, with that, you don't need to worry about packaging all the dependencies and all the stuff. All you have to do is get your container and run it, and it will start working as you expect. But on a machine, container is nothing but a simple process. It's just a simple process, like the Kubelet or container runtime and all those stuff. How do the container gets its unique qualities is, you know, with its own C groups, the namespaces, and the file systems. We will talk about what it is, but this is a high-level overview of what a container is. So let's zoom into the container. So when you want to use a virtual machine or a physical machine or something, the first thing you always want to do is you need to have some resources. You don't want something like I just have a couple of vCPUs or I need some 10 vCPUs or something. You need a mechanism where your machine will get some resources. The second thing is assume somebody on their machine is doing something, and all of a sudden your SSH terminal or whatever breaks when you're debing a production cluster. Just because whatever they did, it's affecting you. It's not the ideal way, right? You should have some sort of isolation. What you should be doing on your container should be totally independent of what happens on the other container. So you need an isolation. The other case could be you're working on something, all of a sudden your machine breaks. It just stops working because somebody else did like an RMRS or a fruit. Okay, you did it, but why it's affecting me? So you should have that proper isolation, right? So that's where container comes in, where you have a way to allocate and restrict the system resources to the containers, and that is managed through C groups, and you have a way to isolate the resource from one container to other. So that's possible with namespaces, and each container has its own file system. It's kind of like even if somebody does a RMRS or a fruit, right? It affects their container, not yours. So it's like you need to have a mechanism where it's completely different. So let's talk about control groups. So C groups is basically a Linux kernel feature which is managed through a pseudo file system. The pseudo file system is actually, in general, it's CIS-FSC group, like in whichever Linux machine you go, it should be like CIS-FSC group, and it's hierarchical in nature. It means the simplest example to check is, just check out the buckets. Assume you have a big bucket, which is like capacity of 10 liters, and you start trying to pour the water into the buckets, the smaller buckets. So no matter whatever you do, the capacity of the buckets is just 10 liters. Even if you pour the full or the half of it, the smaller buckets cannot exceed 10 liters because you only have 10 on the top bucket. Similarly, if you come to the second level of the buckets, if you have some five liters, no matter whatever you do, you cannot pour more than five liters into the lower buckets. So that's like a hierarchical in nature. Then C groups also provides a mechanism to allocate resources. You can mention like how many liters of water this bucket can go and like, okay, I need to place a bigger bucket or I need to place a smaller bucket, that you can do it with control groups. Or you can also do some processes, might not need to access some specific devices or specific resources. You can also control this with C groups. Some of the resources which can be managed by C groups are like CPU, memory, the block IO, the process IDs, and all the stuff. There are some things like IDMA and NUMA, which probably I will never use it in my life. But yeah, they're also managed by control groups. The second thing is like I told you, every container should have its own file system. How do we manage it? How can we make it possible? So any container is actually basically a set of image layers. So when you actually pull a container, you're pulling a set of layers. I mean container images. So what you actually do is when you run a container, it creates a thin read light layer on top of all the stack read layers. So as you can see, there is a set of the four layers which are blue in color. They're all read only. So these are like a standard reference. So I assume I have a Ubuntu image and I'm running several containers. Depending on this Ubuntu, I don't need to have this Ubuntu image like four or five times. All I need to do is have a single layer. So I mean single set of Ubuntu image layers. And the system will just allocate a thin read light layer so that each container has its own right layer. And whatever operations or whatever the right things, I mean right operations it does, it's completely exclusive to that container. And we also manage it by, each container will get its own root file system. This is actually kind of possible with namespaces and also storage drivers like overlay to better affairs and all that stuff. So ideally the mechanism is either you have something like a union overlay file system or you have something like a snapshot based systems. So in that you can actually get each container having its own file system. How many of you have used the BSDJ zone or the change root, CH root? So I think probably you have been very familiar with this right, so with CH root, you have a mechanism where you have a process and you provide a mechanism so that that process has its own root file system. Basically what it means is you see two different colored nodes, right? There are the white ones and the blue ones. So the white ones are the ones which you actually see the file system on the node itself. But the blue ones are the one where you have a fork from the main system and what you see it is as a container. So the container sees everything from the node slash A slash B and all its child are like C2 and C1 but on the parent node, it's actually a path like slash A slash B slash D. Similarly, if there is a container which is running on slash A slash C, it's totally different from the container which is running on slash A slash B. So you have a different file system. So no matter whatever you do, the container on the blue one cannot see outside of slash A slash B. So it's restricted. So you have a proper isolation. And then what is the namespace? So again, namespace is a Linux kernel feature. It provides a mechanism to isolate the resources by process. I mean like, I have a process, there are a couple of process and each process needs to have its own set of resources, right? For that, we use something called as a namespace. It's kind of simulating a virtual machine where the container or a process has its own stuff and whatever is being done is not being reflected in the container. And one of the most popular known way to get a namespace is probably the command unshared. The types of namespaces which are actually supported are like, we have the list, right? So some of them are like the mount namespace, so which provides the ability for each processes to have their own mount file system. Then there's UTS namespace, which is used to have its own hostname. The system IPC namespace, where with this it's possible to have its own like semaphores and the shared memory process and all this stuff. The network namespace and the PID namespace where each process has its own subset of processes underneath it. Then there's a user namespace, which means like each process will have their own set of users. Then there's C group namespace and there's a time namespace. So these are the some set of namespaces supported by the like pretty much like most of the Linux systems. So what actually happens when you create a board? You have something like a Kubernetes client, which is Kube control or a program or whatever, right? Like you initiate the program and then the first thing what it will do is it will connect to the control plane. Ideally the API server. So the control plane could be anything like EKS or GKE or it could be something like a self-managed one like co-ops or QVertman. What it does is this actually talks to the KubeNet to ensure the pods runs on specific nodes. And then the KubeNet actually in this context talks with container D using CRI so that there's a communication happening between our container D and KubeNet. So the container D uses a process called shim. So shim is a very simple program. So basically the idea is to isolate all the container management stuff or decouple the things from actual container D binary to something independent of it. So the purpose is assume you have like 100 containers and if we're managing everything through one container D binary, right? Like I mean like one process. If there is something wrong and you need to restart container D process or whatever like you need to manage it, you're pretty much affecting all the 100 containers. So you need a mechanism to decouple it. And also there needs to be a process to handle the file descriptors, right? So shim does the job. So the container D actually doesn't create all the C groups and namespaces but it relies on something like a lower level binary like RunC. So RunC is the one which is like the real understated hero which actually kind of creates the container. It's manages the C groups and namespace and all those stuff. So with RunC, you get the containers. So if you're already wondering the talk is about pod, what are we talking so much about containers means? I'm getting there. So what is actually a pod? Pod is basically an abstract of one or more containers. So in general, if you run a pod right, you will have something like, in this case let's assume I'm running a pod with couple of containers, container one and container two. In general, the system will actually attach something like a pass container. So the role of pass container is very simple. All it does is it establishes a namespace and it provides a way for other processes to come and attach its own namespaces. So you have a mechanism to manage it rather than each processes managing all the stuff right. You have an easy interface where the other processes or the container comes and attach to the one and the pass process is like ever running, while low pitch also reaps the zombies and all those stuff. So you have the additional benefits. So when you run a pod with couple of containers, so the container one and container two and also the pass container will actually share the same time namespace. And it will also share the same network space, the IPC namespace, the UTC namespace, the C group namespace and also the username space. This is in general and you have an option to choose whether if you want to share the same PID namespace or if you want to isolate it. So by default, the PID namespace is different between the containers of a pod, but you have an option to make it same. Like we will go through that in the demo and for sure each of the container has its own namespace, its own namespace. Otherwise everything will be having the same amounts and all those stuff and it's such a mess. So let's run to the demo. So for the demo, I'll take a very simple example. I'm running this pod called demo pod and all it does is it has couple of containers which is an Alpine container and an Nginx container. In the Alpine container, all I'm doing is, you know, I'm just telling, I'm printing this message called looping and I'm doing it every five seconds and looping just to tell that, okay, I'm alive, I'm alive, I'm alive. So you know this, let's ignore all the things which are commented out. So by the way, is the screen visible or should I make the font more bigger? It's fine, awesome. So what I'll do is I'll just create a pod, okay. So it has got, it is running on a specific node so we'll go into the actual node and see what's happening. I'll go as a root, okay. So the first thing I'll do is, like I told, right, container is nothing but a set of process, you know, like pod. So first thing I'll do is I did run a Nginx, right? So the Nginx process is actually not reflecting on the host node. And the second thing is not actually an alpine but the sleep command which I am using it for in the alpine image. You can see a very familiar similarity that, you know, this was the root process and the parent of is 27308, which is the same as the one for the sleep process which is also 27308. So let's see what's that process. The 27308 is actually, you know, like the container Dshim, like I told, right, the container D will actually spawn a process and ensure that the container Dshim takes care of all the required things, right? And that shim actually spawns a POS container and then the other two containers, right, the Nginx and the alpine container gets attached to it. So that's why you see like the three different processes and they have the same root. So let's check what's actually happening in this namespace. As we know, like any process in the Linux, right, will have its entry in the, you know, slash proc zero-pull system. So what I'll try to do is I'll try to go into the each container slash proc slash. So you can see here, right, like these are the namespaces associated with these three processes, which is actually a container. Like you see that the C group namespace is same for all of them. The IPC name is similarly same for all of them. That's the 854, that's the 854, and that's the 854. And mount space is different for each one of them. You see here, that's like 852. Here it's 856. Here it's 858. So it's actually using a different mount namespace. And the network namespace is common. Again the 97, 97, 97. The PID namespace is different now just because we are opted it to be different, like which is by default. So the PID namespace of this is different and this 857 is different and 859 is different. The rest of the namespaces are same for all the containers, like the time namespace, user namespace, and the UTS namespace. Now we got an overview of namespace. Let's see how it actually reflects on the C group level. So like I told all those C group file, pseudo file system is in slash as FSC group. And these are the resources which are actually supported by the C group. I believe this is like C group we too. So you have something like a block IO, CPU, devices, C group, the freezer, memory, and I think the huge transition of cassette buffer, perf event, PRDs, there's so many things. So I think one of the easiest things to look in is like WhatsApp memory. So we will go to the memory. Here you can see, right, like there are so many slices. By default, Kubernetes always creates the pod under something called a skew pod slash slice. You see here, right, that's why you have all this skew pod slash slice. And the quality of service in Kubernetes provides three different mechanism. One is the best and best effort in the bus table. For this pod, we have not specified any resources. So it comes under the best effort category. So if I go to the best effort, I see this, okay, pod, okay, I see a couple of them. So what I can do is skew control, get, oh, JSON, pod. So basically I'm trying to get the UID. So this is pod AE911694B. So this is the one which is actually a demo pod. I'll exit this. So how it actually happens is, the Kubernetes takes the pod ID in general and uses it for creating the C groups. So I'll go to this pod, double A, double E1. And then you'll have the containers. So you see, like you have all the stuff which are all the memory entries are related for the pod. But you also have the subdirectories, right? These are actually for the containers running within it. Like I told right, it's actually hierarchy. So you have a bigger bucket, which is the pod, and you have the smaller buckets which are the containers. So there's also another way, easier way to check what's the C group, but I'll show the complicated one. So CTR, I'm using something like the container declines and I'll just check for what is there for the Nginx. So you see this is the container name. And if I go to this, so there are again, a lot of entries later to this, you know, the C groups. One thing which I can check, which is pretty easy is like, you know, the cat memory limit underscore bytes. As you can see, right? There's no cap for it. It's like a very huge number. So by default, if you don't allocate or specify the limits, right? It's by default, it's a kind of, okay. Run it till the node can take it. And similarly, you know, you have other directories where, you know, all this stuff, like this is ideally, this could be a pass container and this could be an alpin container. So this is how the actual C groups reflects into the, you know, like the host system. I'll also do a small change in the pod.ml. What I'll try to do right now this time is, okay, before that I'll also show this thing. I'll try to exec into the alpin container. So I'm actually exec into the alpin container. And if I do a ps-ef, so all the process which I'm seeing is just related to the sleep. That's the only thing which I'm running here. I can't see the engine x or I can't see the pass or anything because, you know, I have like different pd namespaces. So right now what I'll do is I'll modify this pod.ml to share the process namespace. And what I'll do is I'll also add the limits for both the containers. So before that, I'll remove the pod which was there just to, okay. So now I'll apply the new pod. If I see here, it has got some nodes. So let's go into those. Okay, we're in. Again, the basic thing is easiest way. Check the engine x process. Get the pid. Check the sleep process because we are running sleep with alpin, right? Get the pid. And let's see what actually happens, you know, but when we share the pid namespace, right? So again, we'll check what's proc ns slash proc slash ns. Okay, I'll do this, it should be. And I'll also just to make it simple, I'll just check the pid one. You can see here, right now, these two containers are actually sharing the same pid namespace because while we specified the shared namespaces is equal to true in the pod spec, you're actually having this, you know, that shared pid is actually reflected into the actual host. And let's see, we also added a memory, right? Like maximum limit of 200 MB. So let's see how it actually works in this. So what I'll do is, this is a simpler way. Each process will have something like a C group entry, right? So 30375 slash C group. So I can easily hop through this using this stuff. Where is the memory? Yes, I can take this path. Again, if I need to traverse, it's always from the, you know, like the C group parent, C group, then I need to check for the memory C group, right? So I go into this, then I go into the actual directory. So if I do something like cat memory limit underscore bytes, I see a different value. It's not the one huge, right? If I want to know actually what it is, I'll just see, like, you know, this is in bytes, 1024 by 1024, you know, this is the 200 MB. If you remember, this was the value which we used in the pod specification. We have pod.tml, right? You see here, right? This was the value, which is, this was like, you know, 200 MB. Even for the NGINX, we used 200 MB. This is how it got reflected. And as, this is hierarchical, right? So this is the memory value for the container. And if I go one level deep back, and if I check cat memory limit underscore bytes, and if I calculate the actual value, 1024 by 1024, it's 400 MB. Because the pod added the value, I mean the limits of both the containers, and you know, no matter whatever happens, this cannot exceed this. And the other important thing is, we need to see how actually the namespaces is shared. We saw when we checked into the Alpine container, right? It was only, we were only able to see the sleep process, but let's see what happens when we, you know, share the namespaces. TI, the demo pod, minus C demo Alpine. You can see the difference, right? Right now, this container can also see the pass process, and it can also see the engine X process, because all of the containers in the pod are sharing the same PID namespace, because we configured it to be that way. Okay, so this was the demo. And just to conclude, you know, like pod is an abstract of one or more containers. You have C groups, the namespace and file system, which actually are the building blocks of the containers. The containers of a pod will obviously should have different mountain spaces, because, you know, that's how it can have its own file system and all this stuff. The container of the pod might or might not share the namespaces. It all depends on the configuration which we have used. And in general, all the rest of the namespaces should be the same. And container D also spawns a shim process, which also manages a lot of other things, you know, like containers using, you know, low-level runtime like RunC. That's it from my end. I hope you had a good time, and I'm open for questions if there are any. I'm always reachable on my, you know, can also reach me on LinkedIn or my email ID. So if there are no concerns now also, I'm always reachable.