 Hey, Fabiano here, a Brazilian software engineer who works for Intel and happens to be one of the five elected members of the Cata Containers Architecture Committee. Hello, I'm Francesco Giudici, I'm a software engineer at Redette. I've been working on Cata Containers since 2020. Cool, let's start then. We're here today to tell you all a story that started somewhere around last winter. It was a cold and dark afternoon, and we were trying to map which were the QBCTL commands that would work differently depending on the runtime you use it. And oh, we have found a few issues. The majority of them are fixed now, don't worry. But there was one specific question that haunted us for quite a long time. What is the output of QBCTL talk? Well, let me show you that. But before, QBCTL talk is a command that allows the admin to see the resource consumption for nodes or pods. In our specific case, we are interested in comparing whether Vanilla and Cata Containers would return exactly the same results to the admin. Now, follow me. As you can see here, we have a simple pod definition that we are going to deploy, using the default container runtime, which is RunC. And we are also going to deploy a slightly modified version of that. But this time, using Cata Containers as its runtime, and you can see by the runtime class name set there as Cata. Also those are up and running. We then can compare the output of QBCTL talk pod, which is the same as respected. It's not always in life that we try something and it simply works. Let's just enjoy this moment, call it a short presentation, unless someone has any question to ask. Does it work on OpenShift? Does it work on OpenShift? Come on. Okay, everyone, it's not time to go yet. You all have to stay for a little bit longer, but not that much though, as there is absolutely no reason for the results to be different on OpenShift. Let me just grab a closer and show that to you. Same drill as before. But now on OpenShift. Okay, we have exactly the same pod, again being deployed using RunC as its container runtime, and we're going to do the same with the Cata Containers pod. As you can see here, exactly the same, but with the runtime class name set to Cata. We just deployed it and it's going to take some time, but you will see by the moment the pods are up and running, we can just call OCADMTopPods, and as you can see, the results are exactly the same. Wait, wait, wait, wait, okay, we will have to stay for a little bit longer. I'm wondering here, like what would be causing such a difference on the output? A few things are crossing my mind, the first and the most obvious one being the container engine usage. On the plain Kubernetes cluster, I'm using container D, but I have done some tweaks and fixes on the cryo, and I got cryo to show exactly the same output as container D running on a plain Kubernetes cluster. Having the same tweaks and fixes done on OpenShift cluster, the results are still different and they differ by a lot. The only thing that I can think of in order to try to understand this better is following the data. Francesco, hey, hey, could you borrow us your brain a little bit here? Do you happen to have some idea where the data is coming from? Weird. Okay, give me a few minutes. Let me check. Hey Fabiano. Hmm, could we see the alt top? Okay. Could we see the alt top gets the data from the matrix API, which is exposed by the matrix server? The matrix server collects pods and containers data from all the nodes in the cluster, wherein the KubeLab summary API that is exposed by each node. Let's see how pods and container starts are retrieved in each node. The KubeLab summary API is feeded by Cadvisor. Cadvisor is a bit of software that collects a lot of data from the node it runs into. What we are interested here is that it retrieves the pods and the containers data. It is able to collect the data by directly inspecting the C-groupFS on the list. Cadvisor is not a subnum process, but it is embedded in the KubeLab code that runs in each node. To be completely correct, we have to add that Cadvisor talks to the container engine. A container engine specific plant in is needed in Cadvisor to allow proper retrieval of the pods and container statistics. The container engine will tell Cadvisor the container EDs that are needed to retrieve the correct C-groupFS path, from which it will extract the container's resource usage. Then we have a problem, Francesco. Let's take a quick look at Cata containers. And let's do this without going through all the architecture details of the project, because this has been already covered in the past, and this has been already nicely covered in previous editions of this very same conference. What I'm inviting everyone here to do is a simple brainstorm about the problem, taking a quick look on the anatomy of the container creation request and then put it together with the information that Francesco just gave us. When a request to create the container hits KubeLab, the container engine will then run an instance of Cata containers runtime, which is also referred as Cata Sheen or simply Sheen. The Sheen will then launch a configured hypervisor, like Cloud Hypervisor or Keemo, for instance, and the hypervisor then will create and start a virtual machine. Inside this virtual machine, the Cata container's agent will be started as part of the boot process and it will then be alive and responsible for the whole lifetime of the container. Let me be crystal clear here. The container is running inside the virtual machine. This means that if Cata Deviser is looking at the Cata container C groups on the whole of the study, Cata Deviser would only be able to get the information about the Sheen and the process started by the Sheen, not about the container itself, and this explains the huge values that we are seeing. However, Francesco, on the Kubernetes, we were getting the pod stats. What was going on there? Yeah, there is something more Fabiano. The Kubernetes CRI, the container and time interface, that is between the KubeLab and the container engine, allows to pass container-level data directly from the container engine. In this case, the advisor will still be in charge to collect pod-level data. This is exactly what is happening in Container D, but let's elaborate a bit more on how the container starts or collected via the CRI. So as we have just said, the KubeLab retrieves the container statistics from the container engine through the CRI. The container engine gets data from the Kata Sheen, which in turn retrieves the data from the guest to us. Let's work up where the data comes from. KubeCityAlto gets the data from the metric server. The metric server will collect pod and container statistics from each node querying the summary API. In each node, the KubeLab will be responsible to retrieve the pods and container stats from the advisor. Container stats may also be retrieved directly by the container engine through the CRI. In this case, the container engine will be able to get container statistics directly from the Kata Sheen. Okay, Francesco, I think I get it now. So on the plain Kubernetes cluster, the stats are being provided by the CRI, while on the OpenShift cluster, the stats are being provided by the CRI advisor. Okay, but it still leads me to another question. Is there any specific reason why OpenShift decided to use the stats from CRI advisor? Yes, it is because of the performance of double collection of similar data from the container engine and from CRI advisor. But pod-level statistics and container-level statistics help to inspect the host to perform the data collection. When this twice, as some performance impact we were able to measure and we don't want to pay. What would be great would be to be able to collect all the metrics in an efficient way. So to this, we need basically one single collector of pod containers and pod metrics that is able to collect also data from inside the VMWareTimes. So the idea is to hold past that by the container engine through the CRI, extending what we already have there. We not only have to pass container statistics, but we can add also the pod ones. This is exactly what Peter Hunt did in his Kubernetes extension proposal. You know what, Francesco? It would be really, really nice if you could find someone from the Kubernetes side that could actually give us an overview on Peter's cat. Actually, I can help with that. Hi, everyone. My name is Peter Hunt and I maintain Cryo. I've been part of the upstream effort to make pod and container stats make more sense with the CRI. In short, we saw a couple of problems with the current stats of the structure. The CRI satisfied some pieces, but those pieces were not complete. The C-Advisor filled in the gaps, but we didn't want to just use C-Advisor as it's monolithic and hard to maintain, plus the point of the CRI was to abstract out podding container operations into clear methods. Further, it was not clear which values were coming from the CRI and which from C-Advisor, and because both the CRI and C-Advisor returned some stats, some expensive operations were duplicated causing performance issues. To fix this, we're updating the CRI now with more stats. The cap for Kubernetes enhancement proposal is called pod and container stats from CRI, and the intention is just that all stats in Kubernetes related to pods and containers should be coming from the CRI instead of a mixture between both. This includes the CubitStats summary endpoint, which the scheduler uses to make informed decisions about where to put pods, which nodes to put a pod on, as well as the C-Advisor stats endpoint that the metric server pulls its information from, which populates for metheus. The cap is currently at alpha stage as of Kubernetes 1.22, and we hope to move it to beta in 1.23 and to graduate it in 1.24. That means, in less than a year, hopefully all clusters will get their stats for pods and containers exclusively from the CRI. No more hacky workarounds, only direct information from the runtime through the CRI implementation. Peter, what a nice surprise to see you around in this talk. Thanks for the work done upstream, and we're looking forward for the graduation of this cap. It's going to help Cata containers immensely. And why we still have your attention. And Peter, Peter is sharing a talk later today with Sasha at 2.30 PM UTC. Make sure to pay them a visit. Francesco, now back to you. Thanks, indeed. With everything said, are pods and containers enough for Cata? I think so. We are now, we have reached a feature party. It was not easy. Containers are hidden inside the gas VM of Cata, but we are about to provide the same metrics that we provide for vanilla containers. What do you think, Fabiano? I think you were right, Francesco, and I say this is a good achievement. Right? We have just shown the journey that Fabiano and I started on Cata containers and the metrics. A journey that brought us to the point where we reached a feature priority with vanilla containers. We have both pod and container metrics available. The reaction we got from the team was not exactly what we expected. We want more data. Hm. But what else could be provided, then? The sheen and the hypervider specific stats. Maybe guest and agent stats? I don't know. It's overlaid that, luckily, someone has upstream at the seminar. You've been from the Cata containers community has proposed a solution to this. A solution built around Prometheus and that introduced a brand new Cata demo that is called a Cata monitor. Francesco, sorry for the interruption, but before you go ahead and present the Cata monitor, would you mind to just throw some words about the Prometheus? I'm not sure if I am that familiar with the Prometheus project. Yep, sure, Fabiano. So, OK, we will see both Prometheus and Cata monitor. Let's start with Prometheus first. Prometheus is the software for master collection in Kubernetes. It pulls data over HTTP from endpoints and do these at regular intervals. When it collects the metrics from the endpoints, it attach the same data to that, so it will have a set of time series data. It allows also to attach to the metrics and labels that are a pair of key values. Moreover, it allows to query all the data it has using a special query language that is called Prometheus query language. Let's talk now about the new demo, Cata monitor. Cata monitor will provide metrics about the Cata ship. The Cata agents process, the hypervisor, and the GSS. It will expose all these formative information in an endpoint to be scraped by Prometheus. There should be an instance of Cata monitor running on each node. And the binary itself will have just few arguments making it pretty simple to be used. In order to better understand the Cata metrics, let's have a review of their architecture. As we said, Cata metrics are built around Prometheus and Cata monitor. Cata monitor, we could let all the Cata metrics expose them in an endpoint that Prometheus will scrape at regular intervals. Cata monitor will be able to retrieve all the Cata metrics by, first of all, contacting the content ranging, getting the list of the running Cata workloads. For each Cata workloads, it will contact the associated Cata sheen on a dedicated socket that is a metric socket. On that socket, it can get information about the Cata sheen itself. And do those data that the Cata sheen is able to access. Cata sheen, in fact, is connected to the Cata agent of this running in gas kernel. And so it will be able to get also Cata agent, gas to S, hypervisor, and container informations. One thing that is worth to mention is that Cata monitor will start collecting all the data only when Prometheus will scrape its endpoint. So that will be basically no overhead if Prometheus is not scraping the Cata monitor, you know. So we have shown a lot of stuff. It's time to put our hands on it. We have installed an OpenShift cluster with six nodes. Three of them are master nodes while the other three are worker nodes. I have already installed a Cata here using the OpenShift Sumbox and Containers operator. I haven't deployed any workload yet, any user workload, but I have prepared a YAML file for that. Here we have a very simple deployment. There are two replicas of a very simple container based on a Fedora 35 image. It's just starts a script saying yellow and sleeping for one hour. Of course, the runtime class specified here is Cata, so we will have Cata workloads. The idea of this is just to have simple workloads that will allow us then to retrieve the metrics of Cata. So let's apply this. Okay, all done. Let's see if it is already running. Once it's missing, we are done. All right. I have already installed the pieces that are required for Cata metrics. That is, I have deployed the Cata.mov.demo set and the configuration to expose its endpoints to Prometheus. I have deployed them in the container, in the Sumbox, the Container operator namespace. So let me show the most interesting part that is the demo set. Set namespace, open shift, Sumbox, container operator. Okay, as you can see, there are three instances and this is exactly what we expected because we have three worker nodes, which is where the Cata workloads are available. Okay, let's see the specific instances then. Come see if you all get what's namespace, open shift, Sumbox containers, great. Okay, we can see the three instances of the Cata monitor on the three worker nodes. Let's check now our deployment instances. Good points, where do you run? Okay, we have two instances. One is running on the worker two and one on worker zero. So what I want to do is to inspect what's happening inside the worker zero on the Cata monitor instance. This will be this one. So let's put a simple, simple shell there. So the namespace is open shift, box is container operator. I want to exact on the box. I want the bash shell. Okay, where are we? So the first thing I would like to check is, if there is a Cata monitor running, let's check process number one, woman line. Okay, we can see it's like exactly Cata monitor. The bus, log lab of the bug. I put this because I want the logs to be more robust. And this is important on the runtime endpoint, which is basically the socket that will allow communication with the content range. In open shift, we have cryos, the content ranging. So the runtime part should be this one. Cata monitor, where we can see the help. We have used a log level. We have used a runtime endpoint. We have not specified the listening address. By default, it will be the ODP address on the four port, 1890. So what we can do here is just to, just to scrape manually the local endpoint. This is something similar to what Prometheus will do. Metrics endpoint of the Cata monitor. Let's see. Yeah, really, a lot of data. A lot of data coming. And yeah, should be pretty big. Many lines. Well, 12,000 lines. Cool, we are getting the data then. This is basically what Prometheus will do regularly. All the data will be collected and brought to Prometheus. It is now time to connect the open shift console to check the Prometheus metrics there. Let's begin. Here we are. So let's check the workloads. There should be our deployment. Yeah, with the chip pods. The chip pods are run. Let's see. Everything fine. Okay, you can find the interface to Prometheus under the observe metrics section. Here we have the chance to query Prometheus data by using the Prometheus query language. What is handy here is that there is also a nice auto completion feature. If you start typing in the metric name, okay, so let's see if there are Cata metrics. Okay. The fact that we have auto completion working and already showing us some metrics means that Prometheus has scraped the Cata monitor data. So it has the Cata. As you see, there are really a lot. No, okay, these are not no more Cata related. As we've seen, there are sheen data, agent metrics, hypervisor one. Let's look at the gas to S metrics. Maybe let's take the memory related metrics. There are a lot. This is a set of metrics. You can discriminate between them looking at the item label. See here, this is the active, let me info. Let's look for something more manageable, something easy. I don't know, they're not free. So we can filter it using the query, using the Prometheus query language by specifying the label item and the value which is not free. Okay, cool. At this point, we just get two metrics. That is what we're expected because we have two pots in our deployment. So two metrics, one for each pot. You see different instance. This is the Cata monitoring instance from which the metric has been retrieved and different sandbox ID. The value is pretty much the same as expected. We have the same container running on similar VMs. So the free memory should be the same. Here you have a nice graph that is time-based because as we said, what Prometheus does, it will scrape at regular intervals, Cata monitor and attach to that data at timestamp. So it will be able to show you how the value, how the metrics changes during the time. In this case, there are not that much changes because the pots are doing basically nothing, just sleeping and they are also basically the same value. This is why we don't see the blue line because it's exactly under the yellow one. So I think this gives you an overview of how much metrics are there and how you can easily retrieve the metrics from the Prometheus interface. In order to deploy all the required pieces to get Cata metric on your cluster, you will need to have a Cata monitor instance deployed on each Cata node. This is easily achieved with a demo set. You will need a service object that will allow you to expose the HTTP Cata monitor endpoint and also a service monitor that will allow Prometheus to scrape that endpoint. This is what we are to deploy on the OpenShift cluster of our demo. Well, good news for you. For OpenShift, you will not have to do all the stuff. It will be made already available for you in the four camera releases of the OpenShift subbox container operator. So you will get all installed automatically. If you want destruction to install Cata metric species on your cluster, you can check the option documentation. It will guide you step by step. We'll put the link there. And with what Francesco has showed you, we are closing our story. When we started, we were not even able to get data from QBCT at all. On the way, we have learned a lot about metrics, about Kubernetes itself, about the CRI, about Prometheus and finally about Cata monitor. And we hope, as developers, that we will be able to deliver you all an easy way to consume this whole amount of data you may need for a better observability of your workloads. There is a bunch of other things happening on the Cata containers ecosystem right now. Improvements all over the place on performance, usability, observability and of course on the configuration containers aspect as showed yesterday by Kristof and Jakob. We would like to say a big thanks to everyone who attended our talk and a very much special thanks to Peter Hunt and Adele Zaluk for contributing to this talk. You rock. Last but not least, Cata containers. The upstream project is available on GitHub and if you want to reach us out, you can do preferably via the Cata containers Slack but also on IAC. Join us. We have unlimited amount of fun and work to do together. Thanks a lot. Take care. Stay safe.