 Hello, everyone. Good night. Good morning. I'm Liubin from AntGrobe, and it's my pleasure to give a talk about the new CataContinuous tool and the observability in CataContinuous. Observability has become the key character of the cloud-native infrastructure. CataContinuous tool also takes observability into account and eight to one CataContinuous in enterprise production environment. In this topic, we will introduce how observability integrated in CataContinuous is special metrics, the basic and important part of observability. And here is the major agenda. First, I will give a brief introduction of CataContinuous and what's new in CataContinuous tool. Next, we will see what's observability, why we need it now, and the parts of observability. Last, we will check out the detailed metrics implementation in CataContinuous tool. What's CataContinuous? CataContinuous is an open infrastructure project of OpenStack Foundation. CataContinuous has a container runtime like RunC or Docker to better understanding. But the new idea of CataContinuous makes it different from RunC as that CataContinuous is a VM-based container runtime. CataContinuous provides a way of isolating containers or cloud with security comparable to virtual machines but not have a performance overhead like traditional full virtual machines. CataContinuous can run on mainstream CPU architectures including x86 ARM and PowerPC. Also, CataContinuous supports many VMM like QML, cloud hypervisor, and fair cracker. The most important is that CataContinuous supports all the container specs like OCI, CRI, and CNI, CSI, and work perfectly with CIO, Continuity, and Kubernetes. This page shows a bigger difference between VM-based containers and shared kernel containers runtime like RunC. In traditional containers at the core, containers are just processed running on the same host and all containers share the same host kernel. Isolation is achieved by kernel features like name spaces and C groups as used to do resource control. That's the biggest security threat attacker breaks out of the container. He will be on the host, in fact, and be the root user. To reduce the attack surface, we must use Linux capabilities to restrict privileges to avoid running with full-router privileges. Another choice is using psychocommap profiles to use fine-grained Cisco, blocking filter and accounting. User name space can be used to map root user inside the containers to an unproved user. But CataContinuous can be rescued from these threats. CataContinuous provides virtualization isolation layer to help run multi-tenant container deployment in a more security manner than running containers natively on bare metal. Here is the features of CataContinuous. CataContinuous runs in a dedicated Linux kernel, providing isolation of network IO, memory, and can use hardware-enforced isolation with virtualization VT extensions. CataContinuous has compatible ability like RunC, support industries, standards, including OCI container format, Kubernetes, CRI interface, as well as legacy virtualization technologies. Though CataContinuous uses VIM, but it can deliver consistent performance as standard Linux containers, increases isolation without the performance text of standard virtual machines. The speed of containers, the security of VIMs, this means the user should not care about the performance overhead and use the minimal resources by runtime itself. Why isolation is important? Basically, we can say that containers are processes running in isolated sandbox, but isolation is not only limited to process isolation nowadays. Besides security isolation, performance and failure isolation is becoming more and more important for large and multi-tenant clusters which run in workloads of diversity. Security isolation can reduce the attack surface against the host system. The host kernel don't need to schedule the guest threads, therefore, there is a less impact between containers running with different workloads or from different users. Failure isolation is an important tool because different containers are using different kernels. Even if one application triggered a kernel panic, it won't affect other containers running on the same host. This can keep users' workload safe from other users' failure. And the most exciting thing is that CataContinuous works perfectly with Kubernetes. Kubernetes supports multiple container runtimes in a single cluster or one node. For example, you can use RunC and CataContinuous on the same node. The work runtime sometimes makes people confused. In fact, there are two types of runtimes, the high-level runtimes and low-level runtimes. High-level runtimes mean maintenance of the entire lifecycle of a container. For example, downloading images from an image repo, managing running containers, the real work of starting or stopping a container as done by low-level container runtimes. Container D is the most used high-level container runtimes. And another choice is CIO. BOSS is supporting CRI, so BOSS can work with Kubernetes. Low-level container runtime is responsible for creating and deleting containers and implement container runtime spike defined by OCI. The most used lower runtime as RunC. Of course, others include the divisor and CataContinuous. CataAgent has one component running inside the PoweredStandbox, as a guest OS, and communicates with CataSheem through TTRPC protocol using a VSocket connection. CataAgent works like a low-level runtime in guest OS, responsible for managing containerless lifecycle. And CataContinuous is also responsible for VM-based containerless specific works. For example, device hard plug and updated resources dynamically. So next, we will go on to the main topic, talking something about the new CataContinuous tool. One of the most big changes is a rewrite over CataContinuous agent to help reduce the attack surface and reduce overhead. The agent was rewritten in Rust and used a lower-level version called the TTRPC to reduce the memory. The main benefit users will see that memory usage reduced from 11 megabits to 300 kilobits. This release also added support for the cloud hypervisor VMM up to the same level of support SQL. And also, in CataContinuous, to VSocket has been the default communication protocol. And the word IOFS becomes the default fair sharing mechanism to improve performance. Also, CataContinuous tool offers significant improvements around oversurability and manageability. CataContinuous now provides metrics about the runtime itself, the VMM as well as the guest kernel, the agent in the guest. All the metrics are in permission format. This will help OPS or SRE with understanding the infrastructure impact of running CataContinuous and will help users and the developers to better understand the workload performance. CataContinuous introduced a new tool called the CataAgent CTL. This tool can make a developer happy when debugging the agent API and call agent inside the guest OS. And it is celebrated that the CataContinuous tool will be released. Now let's see what's oversurability. First, we will review the official definition of cloud native before we talk about oversurability. From this definition, we can see that cloud native will suggest some technical approaches like containers, storage smashes, immutable infrastructure, and make a service. The second paragraph describes the ideal system attributes for cloud native technologies like loosely coupled automation, manageable, and also observable. So we can see that the observability as a key index for cloud native technologies. And here this page shows the key attributes for cloud native infrastructure. The first and important is that cloud native applications should be packaged as lightweight containers that can scale out rapidly. And elastic infrastructure can achieve availability and scalability, but that will be beyond the scope of this topic. Needless to say, one of the best technologies is Kubernetes for automating deployment, scaling, and operations of containers. And now this observability as gaining attention in the software world. A system as observable developers can understand its current state from the outside to have engineers to deliver excellent customer experience rapidly and frequently in modern complexity IT technologies and environment. There are a lot of definitions of out there as to what the word observability means. It might mean different things to different people for someone as just about logs, metrics, or traces. The word observability originally comes from engineering and control theory from linear dynamic systems. A generally accepted definition of observability as a measure of how well internal status of a system can be inferred from knowledge of external outputs. The observability and controllability of a system are mathematical deals. In software world, control is equal to operations or management. In the software world, observability means getting realisation and analysis of metrics, events, logs, and traces, giving you the ability to examine and understand the system state. Let's you understand how they are behaving and performing when and where and why something goes wrong. But observability is about much more than the collected data. The goal of observability is not to collect logs, metrics, or traces. Like DevOps, it's not only tools or technology, but also includes a process in one organisation. And it's to build a culture of engineering based on observability and controllability. There are four pillars of observability. First, the metrics represent time series management. Metrics are low overhead to collect, cheap to store, and can be aggregated and has a dimensional data structure for quick analysis and easily processed. Events are critical, but often overlooked data in observability are similar to logs, but events have a higher level of abstraction. There are many types of events, alert events, events, deployments are events, and the failed user request or system errors are all events. Events are valuable because we can use them to confirm that a particular time, when a particular action occurred. Logs record all the execution in your application, and almost all software systems can emit log data. The most common use case for logs is for getting a detailed play by play record of what happened in a particular time. When things go wrong, logs show the error or exceptions. Structured log data is formatted especially to be passed by a program. Structured log data makes it either faster to search and to search the data, and generated events are metrics from the logs. The last part is tracing. Traces represent a single request to the system. Traces are valuable in a distributed architecture to show the end-to-end latency of RPC calls. The relationships between service or entertainings. Traces enable engineers to find bottlenecks with the specific parts of the execution part or identify errors. Before Cata, before Cata Continuous 2, JR has a number of limitations related to the observability that may be overscurred to running Cata Continuous at scale. Cata Continuous implement CRS API and support two interfaces to expose continuous metrics, the continuous data and the least continuous data API. Using this interface, we can only get the basic metrics about containers. The metrics as a container-centric and only include the CPU time memory usage and file system usage. It's hard to say that through these metrics, we can assert and influence the container as health or running into troubles. In Cata Continuous 2, metrics as the first class cities and also there are some strict restrictions in metrics implementations. First, the metrics should not become the bottleneck of a system and downgrade the performance and must run with a minimal overhead. And the metrics should not make Cata Continuous complexity for deployment should avoid using traditional metrics collector or aggregate agent. In Cata Continuous 2, metrics are collected mainly from file system and then consumed by permission server based on a pool mode. That is mean if there is no permission server collector as running, so there will be zero overhead info nobody cares about the metrics. We chose the permission the industrial standard for metrics exposing collection and aggregation solution. Permission as a graduated project from CNCIFU and widely used. Permission has a multi-dimensional data model with time series data identified by metric name and option key value pairs which we should call the labels. Permission also supports the service discovery to find the target where to put metrics from. This is flexible when a target changes dynamically and frequently especially for Kubernetes cluster. Metrics in Cata Continuous 2 covers all complement of Cata Continuous including the container D-Sheem V2 process, hyperratheraesthetics and the VMM aesthetics. The agent running inside the guest OS and the guest OS aesthetics. The first step over the ability as to cover all the aspects in a complexity system. This page shows the metrics pipeline in Cata Continuous 2. Cata monitor is used to collect metrics from container D-Sheem V2 process from the same node or do a basically aggregation work. Finally, a permission server will pull metrics from Cata monitor. Container D-Sheem Cata process provide metrics about itself and hyperratheraesthetics running on the host. Agent is responsible for gather metrics inside the VMM including agent process metrics, guest OS metrics. In the next comment page, I want to show some green step shorter of the Cata Continuous metrics of 2. First as metrics for Cata monitor process itself, we can see some operation metrics for example the script count, how many times the pulling metrics request as the issue and how many are failed. The running she must represent the running pose, the running pose number. Also metrics of the GoRoutines number, the used memory, the open failed descriptor count are collected. Here is show the metrics of container D-Sheem Cata V2 process, the main process in Cata Continuous runtime. Besides the basic metrics, we also collect RPC latency metrics for CRI interface and the agent interface. Using this data, we can easily calculate the P99 or P50 average for the RPC cost. This page shows the metrics about the agent process that running inside the guest OS. And last, we will check the performance and overhead concern for some users. We collect many metrics for one sandbox that will add more than 70 time series. Another GZ size is 7 kilobytes one request. The end-to-end for permission server scribe operation as under 20 million seconds. And as RPC call for get metrics from agent as 3 million seconds. We can see this will not introduce distinct overhead even in production elements. Assange for your other lesson for listening to this topic. I have only did a very basic introduction for Cata Continuous 2 under the observability aspect of it. And here shows the more information about Cata Continuous issues for Cata Continuous repo as always welcome. At last, thanks again for your watching. Thanks.