 At this point though, it is my honor and it gives me great pleasure to introduce you to Derek Carr. He is another core engineer to a key Kubernetes contributor within the Red Hat engineering team and he's going to walk us through some amazingly interesting content in the form of what does it mean to be a pod? We're going to drill down into that pod and we're actually going to visit the pod universe and across the cosmos. So at this point, let's turn it over to Derek. Derek, over to you. Thanks, Bar. Let me share my screen. Hopefully this is coming through. So what I wanted to do today was dive in a little bit deep into how pods actually work, right? We see a lot of innovation in the developer community wanting to run their apps and as enterprises look to adopt OpenShift and Kubernetes, sometimes there's some challenges in trying to figure out how to get it through your security team, understanding the flow that the platforms have to be providing to support running your apps securely of production. So in today's demonstration, I kind of want to do a very deep dive on basically all the network connections, the individual components that are used to actually bring up your pod. A little bit about myself. Similar to Clayton and Jessica, I've been working on OpenShift for a very long time. Active in the upstream Kubernetes community, presently privileged to serve as a steering committee member and I helped co-chair both the architecture and node sigs. And in the past, I ran a resource management working group that helped drive some little features around things like GPUs and huge pages and stuff like that to allow people to build more exciting workloads to run on the platform. I'm a distinct engineer who is a member of our OpenShift architecture team. So excited to talk through pods and Kubernetes today. So typically things start with the app, right? The developer has an idea. They put some energy out and write up a little container and they want to deploy it. The next question typically is, well, where do I run this thing? And you know, Red Hat, we're a big believer in the hybrid cloud, but at the end of the day when you choose to run in a hybrid cloud, you need an operations team, right? So you have to pair that developer with your operations team to provide a place to run your app. And because we're here talking about OpenShift today, we're going to assume that your operations team has created a cluster for you in some region of the hybrid cloud galaxy. And we're going to name that cluster home for your application. And everyone dreams that their applications are going to be super popular. And as a result, you need a lot of ingress to drive traffic into your cluster so the world can access it. And within that Kubernetes cluster, that ingress is connecting to services that ultimately know how to route traffic from the outside world into your nodes and into your pods. And so your nodes end up looking like this, but as an user, you can't see that. And at the end of the day, those nodes are just executing a bunch of pods. And we like to think of pods as these compartmentalized, self-contained units. And we depend on things like the KubeLit and the Linux host to provide the right primitives to isolate one application from another so you can drive density and improve that user performance for your workloads. But a lot of times, that's where discussion ends. And we just kind of take it for granted that the system works. But when working with your operations team, it's often important for them to actually understand the guarantees and the security boundaries that come with running pods on the platform. And so we're going to take a step back here, Burr, and we're going to ask, how do pods actually work? And so what we're going to do in today's talk is kind of run through a simple demonstration. A lot of people, when they interface with Kubernetes type KubeControl help, see a KubeControl run and want to run a simple pod. In this case, we're going to run a busybox container and get a shell into that container. And we're going to talk through how that works. What I'll do here is KubeControl run and we'll start my pod. Now, right now I have an OpenShift server cluster deployed in Amazon right now. And assuming everything goes well, I will now have a pod really quickly. And so when I'm in this pod, everybody's used to familiar containerizing demos and you can type things like PS and see that I'm the only process running in this pod. And you can type things like DF and you see the files that are available to me. And we're depending on the platform to provide the right isolation between the host resources and the resources that are visible to the container. And so we like to ask, well, how does that work? And so what I want to do to take a step back here is talk from an end user experience, in this case, that KubeControl end user client and talk about what's happened before I got that pod. And so the first key network flow to understand is basically that client to control plane traffic communication. So a default OpenShift distribution of Kubernetes will put a public load balancer in front of a set of machines that we use to label as running your control plane processes. And that control plane is basically running a key process called the Kube API server. And it's listening on a port, serving over TLS, traditionally over 6443. And your client is ultimately connecting to one of those API server instances over that load balancer. And that client is typically connecting by providing a set of authentication credentials, so a CAS cert and a key, that allow you to identify who that client is and what rights they have against that API server. Now, there's a lot of configuration knobs that we exposed with an OpenShift to allow you to configure things like how TLS is secured, how encryption occurs, and basically controlling what type of certs you can deploy in the platform. But at the end of the day, that client is interacting with that API server to declare the state that it wants to have. So in today's demonstration, when I ran that busy box container, I said, I want a pod, run it for me, please. I don't care where you run it, but then once it's available, give me a shell into that pod. Now, parallel to the API server, which is what the end user is interacting with, we have a set of worker machines that are connecting back to those control plane hosts to figure out what they should do. And so in this case, in a typical OpenShift deployment, you have a set of workers that are running REL Core OS, which is our immutable operating system optimized for Kubernetes. And each REL Core OS instance is running a Kubelet process that is acting as the client to our API server to say, which pod should I run? And that communication between the Kubelet and the API server occurs through a different load balancer because it's internal traffic to the cluster. In this case, that's called out as our API internal load balancer. And then that Kubelet on that same host is going to interact with the container runtime. And so basically the relationship between the Kubelet and the container runtime are what controls, ultimately getting your pod executing. And on the right hand side here, you can see that there's some host resource state that the Kubelet is going to set up to manage the isolation view that the pod has from the host. And so there's some state that's going to be stored, which I'll show within Varlib Kubelet associated with every pod that's running on the literal host file system. And then the Kubelet's going to interact with a set of secret controllers that allow you to control the resources that each pod gets at runtime. And so today in Kubernetes and OpenShift, we support controlling things like access to CPU, being able to pin your workload to a particular CPU, how much memory your workload can get, if your workload requires huge pages, we can control access to those large memory pages, as well as provide protection against things like pit exhaustion. The hierarchy that will seed pods get placed in is important. Every pod is associated with a quality of service criteria based on how it consumes CPU and memory. And based on those knobs that the user requests, you get classified into a guaranteed workload class, a best effort workload class, or a burstable workload class. In today's demonstration, we ran that kubectl run for busybox. I didn't specify any resources that should be applied to my pod. And so the platform today is just going to be giving me best effort access to resources. And I'll talk a little bit about what that means afterwards. But right here at a steady state, the Kubelet is typically looking at the API server and says, what should I do? What should I do? And right now we have a node with no pods. Nothing's actually happening. But that's no fun. Even when there's a node with no pods, a lot of the patterns within Kubernetes are based around the concept of a controller. And a Kubelet is really nothing more than the node local controller. And it's trying to achieve the desired state that the API server says it should have. And so when a node says I have no pods, the kubelet's going to keep asking the container runtime to verify are no pods actually running. Now that communication that occurs between the kubelet and container runtime is over a Unix demand socket. And that's important because when we talk about what can happen, when or if a container breakout ever occurred, within our OpenShift distribution, we do a lot of things to protect access to the root file system. In particular, we have SC Linux always enabled. And so the actual end user container processes that I'll run here get a different SC Linux label than other pods running on that system that control actually what they can do on that host resource. And that's one of the techniques we use to protect access to the cluster as a whole. So in general, we don't have any pods running right now, and the kubelet's just always asking the container runtime is the state as I desire. And so if you were to pull traffic between the two systems, you'd see a constant interaction about every two seconds is it as I want it to be. Now, earlier when I went and created that pod to say I want to run a busybox container, my client went and sent that pod definition to an API server. And it's important that when you get your workloads onto Kubernetes that you can control the actual resources your workloads have access to on the hosts that are running them. And with an OpenShift, we have a feature called the security context constraints. And this allows you to validate what host level resources a pod can access prior to admitting it into your cluster. And by default, Red Hat, we pride ourselves on being secure by default. And so we actually enforce what we call the restricted security context constraint on the box, which basically denies access to all host features. And actually will associate your pod to be run with a random UID and SC Linux context that's associated with just your pods namespace. If your workload required greater privileges, you can go and give yourself greater access to run things like privileged or non-root or host network oriented workloads. But that's the abnormal behavior rather than the norm. So once your pod is accepted into the API server because it's matched a matching security definition, some scheduling magic happens, right? And we're going to defer that for today's discussion and just assume a node has now been found to run your workload. And so now the cubit sees your pod, right? It sees I want to run a busybox container, and what should I do? And so let's go take a step back and look at some stuff in action here. So you saw earlier we had this pod running and it was running busybox, and I'm inside my container and I can do some actions. But what I'll do on the right hand side here now is open up a separate terminal and try to explore what actually occurred on the node. So I can run kubectl get pods in this namespace and you'll see my busybox container is running, the API server is telling me it's running, and it's running on a particular node. And so what we're going to do is use a command in the OpenShift OC client to debug that node. And what this is going to actually do is create a pod with heightened privileges that let me actually understand the host level state of that system and kind of get a backdoor into the cluster. Now of course this is an action I can only run if I had heightened privileges and I'm running as a cluster admin here so that's possible. So I will debug that node and now a pod is going to be started on that node and once it's up and running I can cheroot into the host namespace. So now I'm basically a clone of SSHing into that node but I'm running in a pod boundary and there are some things that we can do to explore the state of this host. So as I talked about earlier there's a kubelet running on every machine. So I can see it's actually running. And the kubelet's outputting the logs about the state of reaching my desired state. And similarly there's a container runtime running and that container runtime in an OpenShift distribution is the cryo container runtime. And so cryo is being interacted against with the kubelet and ensuring my containers are running. And so if I wanted to look at the state of the system a lot of users in the early days of containerization and Kubernetes were very familiar with the Docker CLI. There's been an evolution within Kubernetes to support alternative container runtime choices and there's a plugin system called the container runtime interface. And now on every Kubernetes node here there's a debugging tool that lets you interact with your currently configured container runtime interface called CRICTL and you can do some interactions on that host to see what's available. So in this case what I'll do is I'll run a command that says CRICTL pods. And I can see there are some pods running on this host. And right here you can see, okay there's a busybox container running. Now that busybox container earlier when I talked about it is just a normal process, right? There's nothing special about it. And if I use normal tools I can see where it's running. So in this case I'll type systemd cgls I can look at the actual cgroup taxonomy on this host to see where my busybox container is. And I said earlier it's running as a best effort pod and you can see the kubelet has gone and created a pod level cgroup to house that container. So how did that actually happen? So the kubelet saw on the API server, hey you need to be running a pod and that pod's name is A. In this case in our demonstration it was the busybox pod. Now in the host where the kubelet is running pods can contain one or more containers, right? And you might have things like a nick containers or normal containers that can all run concurrently all bounded by a common pod definition, right? And that common pod definition gives you a common IP address and a common view of say volumes. And so to provide resource isolation across those containers the kubelet goes and creates a bounding cgroup for all containers in that pod that will control access to things like CPU and memory. And so when you go and create your first pod the kubelet sees it and says oh I've got to go create a pod definition for my workload and you'll see a new slices created in SystemD's taxonomy. Now the next thing that the kubelet needs to do is get some host resources that might be needed to run your pod projected into your container. And so there are some resources that it will appear in every pod by default. The Etsy host file that's presented to that pod to control access to your DNS configuration that's actually managed as a file on your home on your root file system that the kubelet projects in. And then similarly any volumes that your pod might use to access resources are projected into the container as well. And so if we go and look back at our running host and we look R of R loop kubelet directory you'll see there's a bunch of subfolders in here and what we'll do is look under the pods folder and there's for every pod running there's a directory that matches that pod you would and so we'll just pick one of these and we'll see what's under here and what you'll see is the Etsy host file which actually is nothing more than your DNS configuration that the pod will see. How they go and contact particular services in the cluster and then in addition we can look at volumes that that pod will have access to. Now every pod in Kubernetes typically gets mounted with a secret that can then go phone home back to the Kubernetes API server and so that gets mounted as a volume as well. And on here you can see the secret directory for that. Now there are many different types of volumes that pods can consume and so if you're using things like secrets those volumes are actually stored on a tempFS directory and never actually written to disk whereas if you're using things like config maps they get stored and persisted on that local disk in the directory you saw earlier that we identified here. Once all the volumes have attached and mounted and your container still isn't running you kubelet needs to go fetch the container image to run your container and oftentimes enterprise clients like to protect their container registry via a secret. And so in this case if your workload to start to pull the image needs a secret the kubelet will fetch the pull secret associated with that pod and pass it to the container runtime when doing a container image pool. Now in this case earlier I wasn't using a protected container image but if I was this is basically the flow here. The kubelet by default is very controlled on which secrets in the cluster it has access to. So in an open ship distribution the kubelet can only access secrets for pods that are bound to its node or secrets that are associated with pods image pool secrets. But it can't run or fetch arbitrary secrets for any arbitrary node or any arbitrary pod in the cluster. Once the image has been pulled we go and tell the container runtime over the container runtime interface we want to go and create a sandbox. And a sandbox is basically telling the container runtime I want you to develop a way to manage a common set of Linux namespaces and an IP address for all the other containers that will be in this pod. Now if you go and look at a typical open ship toast you'll see there's a lot of for every container there's something running that's kind of like a pause container. And that pause container is what's holding these common namespaces and IP address. One of the cool things that's recently just getting merged into cryo is the ability to eliminate this extra container. And so we'll be using fewer resources per pod on every node. That's pretty cool. So once the sandbox has been created and you'll see a new item in the secret hierarchy the container runtime will that way told to pull the image using the secret if any that was there earlier. Oftentimes in open ships we'll recommend to provide an improved security characteristic that the pod spec be classified to always pull the image. Because what happens is the even if the image is always locally cached on your node when you pass the image pool policy of always we will reauthenticate that that pod has access to the image. So if you're a security conscious individual and you want to ensure one workload isn't trying to use the image of another workload that it didn't truly have access to you want to check out that pool policy option and specify always. Okay so now we have our container image down and so we have a file system that will define what we want to appear in our container but we still don't really have a container yet. And so the cubit will go and for every container in that pod spec tell the container runtime please create a container for my workload. And there's a number of configuration options that the cubit passes down to the runtime that controls how that container is ultimately containerized. So in this case it's a lot of things that you might commonly see like the command you want to run the environment variables to pass where logs should be stored the amount of resources that that container should be given. But then there's some more advanced concepts that many users never have to actually look at that control what capabilities get added and dropped to that container. What is the privileged rights of that container? Basically what access does it have to the resources outside of its container bound? And so if you go back and we look at our pod that we created earlier we can see some of these things using CRICTL. So what I'll do is I'll look at our busy box pod we created earlier and then I'll take the ID here and then there's an inspect pod command and basically everything that the cubelet is passing down everything that the cubelet is passing down to the container runtime about how to containerize that pod is visible here. So which namespaces the busy box container is able to access and so here as an example you can see the busy box container is able to access for PIDs only PIDs isolated to its container whereas this debugging pod I'm running is actually able to access PIDs node-wide and that's how I'm able to do these developing actions. And if we scroll down here and you wanted to look at the capabilities that are given to each container you can inspect these things pretty clearly using CRICTL and so you can see which capabilities have been enabled disabled things like the mounts that are corrected into that container where the storage is what its zoom score is etc. And so a lot of times if your container is not operating as you'd like and you're really getting into the weeds of how to debug what's going on CRICTL inspect for your container will provide to be will prove to be an invaluable asset supporting that. All right so the container gets created and you basically get that record in the container runtime that says how should I be executed but you're not actually executing yet. And so there's a separate call within the qubit that says to start your container and so basically now that that container manifest is on disk and says how you should run the qubit in the other day just says please run me and only at that point does your process get launched but it's at this time now properly confined in the right level to see group hierarchy with the right IPC and Linux namespaces and the right isolation primitives that were required by that pod definition. So that's pretty cool. So what happens when you delete a pod? This sometimes is tricking up people and so I want to talk through this in a little bit detail. So the qubit is always watching the API server to say what should I do? What should I do? What should I do? And in this case our pod that's running the end user says I want you to delete this pod. Please delete it now. And so the qubit sees that desired state and will start executing what it needs to do to clean up that container resource on the host. And so the qubit in this case sees the API server desired state for that pod should be removed and it will start doing what's called graceful termination. And so for every container running on that associated with that pod it will go and kill that container and give it a grace period in time. So typically that's 30 seconds but you can extend it which basically says how long the container has to shut down before it's forced shut down. And so for each container running in that pod the qubit will ask the runtime to kill each of those containers. Once those containers are killed the sandboxing resource that was holding the IP address and the namespaces for all the containers in that pod can now be cleaned up. And so the qubit will tell the container on time okay I see no more containers are running please go and clean up your pod sandbox by stopping it. And so that will basically tell the container on time to release all those resources. And after that we now know there's no more pod we now know there's no more processes running that were associated with your pod and the resource has been terminated. The qubit will do a little bit of cleanup to clean up those resources under the varlib qubit pods director I showed you earlier. And ultimately the next step here is it will purge that pod secret so the secret hierarchy has gone back into a pristine state. And only after the qubit has detected that all the host level resources used by that pod have been purged will it go and do the final delete back to the API server to delete that pod. And so sometimes end users when they're debugging Kubernetes or OpenShift will see pods in a stuck terminating state. And this can sometimes mean that your node is unhealthy that is executing that pod. And the qubit is unable to guarantee that all the resources have been cleaned up with that pod. And so that final delete does not occur. This is important for certain workload types if you're using things like staple sets that want to guarantee like sequential ordered shutdown and startup. But for other workload types like replication controllers and stuff like that where it doesn't care the control plane may just go and launch new pods to get started. And so earlier in the demonstration we said well how did the exec work how did I actually get that shell into my container what's actually happening. So there's a network flow that goes between the control plane and the qubit that is allowing clients to ultimately get into their container. And so the API server when it connects to the qubit the qubit's actually running a little serving agent on 10 to 50. And the API server will go and make a connection to that qubit and we validate against the serving cert for that qubit. And then the network flow is basically this. So the Qt control client I was running is talking through a load balancer to the API server the API server then proxies a connection to the qubit and then the qubit proxy connection through the container runtime that says I want to have an exacting shell into there. And at the end of the day you have a network flow like you see here that let me get a shell to do the live demonstration we talked through today. Logs are pretty similar. And so when you want to look at the logs associated with your pod there's a similar flow if you're type Qt control logs where the client is going to the API server API server is proxying the pod and then runtime is is flushing those logs response backs. And so with that I think I'm probably running on time. And what I will do is pause here and Berr can I get a time check? Are we we are we are definitely at the moment when we're switching at this point. And so Derek thank you so much for that. That was actually a great deep dive presentation for people who are interested in how the Kubla works pod. I love all that because people are always asking how do these pods really work. And I think at the end of the day people just don't understand they're just processes well managed. Yep. All right. Thank you. So Derek thank you so much for that at this point unfortunately we do have to bid you and do as we switch gears and we actually have a technical challenge here on Crowdcast but thank you so much for that Derek.