 Hello and welcome everyone. My name is Svetomir Sprijanov and I'm going to talk about containers, Kubernetes, Linux kernel tracing and how we can mix all these together. I'm an open source engineer currently working for VMware. During the years I have contributed to many different projects in many domains including Linux kernel tracing, both in the kernel space and the ecosystem around it in the user space. Also in some IoT and tech related projects and also recently started with my first contributions to some projects in the machine learning domain. Short agenda of what I'm going to speak today. First I'll do a brief overview of the currently existing approaches used to trace containers in Kubernetes cluster. After that we'll go briefly to the tracing capabilities so the Linux kernel how they work and how we can use them to trace containers in user space. After that I will introduce the container tracer, a project that I started at the beginning of this year. Until the end there is a short pre-recorded demo to show you how all this works. So let's start. A lot of tools and frameworks exist that are used to trace containers in Kubernetes cluster. There are two typical tracing scenarios. Tracing our application during the development and tracing in production. Usually during the development we want to verify that the application works in the way that it is designed. We do some functional testing usually on a single instance of the application. In that case usually the delays that are introduced because of the tracing are not so important. In the contrary, in production when the application is already deployed, for example we have to find some weird behavior because of some bug and we have to trace usually many instances of our application and here the delays introduced by the tracing are not desirable and should be as possible. Even a single if statement in the code which checks if to admit the traces or not can cause significant delays if it is in a place which is executed a lot of times per second. Currently there are two approaches that are used. The one is using a site container, how it works, a helper container is injected into the trace and the tracing, needed tracing and debugging tools are in that container and it has direct access to all tasks that are running in all containers in the context of this spot. We can directly attach to those tasks, debug them, extract local walks and even redirect the traffic in coming and going traffic to pass this through this helper container to analyze the traffic. This approach is very flexible because we have usually full control of this site car container and we can set up there only the tools that are needed for our specific trace case and also in this case no modifications are needed to the containers that are traced. This is very suitable for tracing legacy code. But there are some disadvantages. Usually the tools in that help container require root privileges in order to be able to attach to all these external tasks. That's why we have to run the entire pod under trace with root. As we do trace the site container, this is not always the case with the containers under trace. In some cases it's not desirable to run them with root privileges, it may be dangerous. Also if we are going to intercept the traffic to pass it through the site container it may reduce usually a big delays. And when we are going to trace a lot of containers we have to inject a lot of site cars which can lead to a big overhead. The other approach that we used when new applications are developed, open telemetry approach is to instrument our code to emit traces. Traces can be put directly into the code or some common libraries instrumented to emit traces can be used. The approach is powerful because we can put the logs in the very specific and important logic in our code and we can collect very detailed work about this important logic. Here no root privileges are required because the tracing is actually part of our application. But there are also some disadvantages. The obvious one, the containers must be modified in advance. So this is not suitable for tracing legacy code. It is applicable only for newly developed code. And this instrumentation code was implemented is always part of our application. When it is deployed in production we cannot remove it. It has also complexity during the development because developers have to design implement additionally all these tracing logic. And once deployed it is very hard to add additional logs. So we have to make another release of the application. Is there any other approach that can be used? In the majority of Kubernetes clusters runs on Linux operating system. All these containers that are running inside the Kubernetes node are actually obstruction on top of the main spaces of the kernel. Main spaces are used to implement some isolation between the containers but from the kernel point of view all these tasks that run inside are regular user space processes that can be traced using the existing tracing framework of the kernel. The framework for the tracing framework of the Linux kernel has evolved in the past 20 or 30 years from a simple set of print case in the beginning just to dump some information on the console to a highly optimized framework currently which has very well overhead. And that's why it is usually enabled by default in the kernels of the majority of Linux distributions. And it also can be used to test production systems in production. So this framework is already there in majority of Kubernetes cluster sitting in the kernel enabled and it's waiting to be used. But tracing in the kernel is more efficient than optimized than tracing in the user space. Let's compare typical tracing implementing in typical user space application and compare that with the existing implementation in the kernel. Usually in our code in the application there is some conditional check in most cases it is if statement which checks if to emit the traces is enabled. And if this don't look so heavy just a couple of instructions but if those instructions are in part of the code which is executed thousands sign per seconds then it can introduce some delay even in the tracing it is enabled. And when the tracing is enabled some lock function is usually called to give a string and the lock function has to write that string into a local file the string on the causal or send the string over the network. And depending on the implementation that lock function it can introduce really significant delays. I have seen implementation on the lock function for example opens that file for every read and after that causes the file so for each lock there is open file closed file or initiate a new network connection send the string and that causes the connection. But even the function is optimized in any case writing to a file that string over the network is very slow operation. Let's compare how it works in the kernel currently how it is implemented. First there is no conditional checks in the kernel. Instead of that there are few no instructions in the places of the code where the traces should be. These are similar instructions no operations just tell the CPU to do not. And in most recent CPUs these instructions are optimized and usually they take zero CPU cycles. So when the tracing is enabled in the kernel this is the only overhead that we have just a couple of no instructions that take zero CPU cycles. And when the tracing is enabled those no instructions are replaced with a code to a trace trampoline. That is the kernel code is modified run time one is reconstructed another. This trace trampoline it receives the same string that should be dumped somewhere and it arrives the string into in memory ring buffer which is optimized for walkless running writing. There can be multiple writers and readers of that ring buffer without the need to walk the buffer. So this is the overhead in the application context when the traces are walked. Compare that to writing profile or sending the string over the network. Of course there is limitations because this ring memory is limited inside it is in memory but from the application context the overhead is small and usually there is some external logic outside of the application which reads the ring buffer during the trace and expores the traces. So far so good but what can be traced in the kernel and how we can use it to trace containers in user space. First almost every function in the kernel can be traced. This is by design. When new function is implemented the kernel developer doesn't need to add any tracing specific logic around it. On compile time each kernel function is prepended by those no instructions automatically. Additional to that there are static events that kernel developers can put into the code and that can extract very detailed information about some important logic in subsystem in the kernel. Currently there are more than 2,000 events that works for almost any subsystem of the kernel. In addition to those static trace points there are dynamic trace points. During runtime trace point can be set into any offset any function in the kernel. The most interesting part for user space tracing are U props these are user props. Trace point can be set on any user space function. How it works? Before the some application the space application is executed it is loaded into the kernel context and on that point it is accessible to these tracing frameworks. The trace point can be attached to user space functions and when the function is called the trace is emitted. There is also U write props we can attach a trace point to the return the function that way we can track how many time the function takes to execute and more interesting we can track all nested calls inside that function. There is also synthetic events we can combine any existing static dynamic trace point to implement some more complex logic Also there is CPU performance counters these are hardware registers that exist in all the modern CPUs that can be used to track the performance of your application. To find what part of your application takes most CPU, most memory CPU network and to higher resources. And also there is eBPF using eBPF you can design your own trace logic and attach that trace logic to any of the existing trace points in the kernel. This is in the kernel but how we can use those from user space. There are tools and frameworks for that. Ftrace is the main framework that can use those libraries in some user space. This can be used to enable, disable those trace points in the kernel to set new dynamic trace points to configure different tracing stations and to execute those tracing stations in parallel. Each tracing station has its own dedicated ring buffer. So there can be multiple tracing stations. The other interesting tool is Perf. It uses the CPU performance counter to analyze the performance of your application. It can point you to the exact backtrace of your code which causes the most CPU usage using those CPU hardware registers. And if Perf and Ftrace are not enough BPF Trace can use to implement your tracing logic and to attach it to any of the trace points. So containers actually form the kernel point of view of your tasks. The idea of tracing time is just to use any of these tools and to attach these tools to the tasks that are running in those containers but we have to define them as the same way the kernel from the kernel point of view. The container tracing project is introduced for that to make that bridge between the tracing in the kernel and the containers running in Kubernetes cluster. It's going to be simple and efficient for all level system traces per container. It is a no process project originated by VMware at the beginning of this year. It leverages the existing Linux kernel tracing frameworks for tracing containers. It is designed natively for Kubernetes but can trace any containers running on a Linux system. It does not use site containers and does not modify in any way trace containers. How it works? The idea is very simple. There is a Tracer node which is running inside its own container. It can detect all containers that are running inside the system. In the Tracer node there is a set of trace hooks which are implemented using one of those tracing frameworks. Any of those trace hooks can be attached to any of the containers. There is a REST API which is used for the clients and the traces are exported outside of the Tracer node. As you can see on this picture there is still no Kubernetes. This is the minimum installation. It can just run that Tracer node on the system. It can detect all other containers. In that case it uses the information from the profile system to extract the list of the containers list of tasks for each container and files that are associated with that container or information which is needed for tracing. This is the brief overview of the architecture of the Tracer node. It has some in-memory databases which have meaning only for this instance of the Tracer node. There is a logic which periodically looks for containers and extracts for each container a list of tasks and files that are associated with the container and store that information in the containers database. There is a Trace Session database which holds the current and future tracing session on their current state. They can be running or stopped. Also the important part is set up these trace hooks that can be attached to any of the containers. An logic to read the traces that traces outside of the container. And also there is a REST API exposed to the clients which can be used to get the list of available ports in the system that can be traced, a list of trace hooks that can be attached to any of those ports list of trace session and post requests that can be used to create a new trace session to start and stop it and also can delete it. When running in Kubernetes the picture is a bit different. There is a Tracer node running on each Kubernetes node. There is a demo set which is possible for ranging all these Tracer node instances. There is one instance of Tracer API which exposes the same REST API to the clients and it acts as a proxy. Each request received by the client is forwarded to all active Tracer nodes in the system. Tracer session can run multiple nodes depending on where the containers are running. So each Tracer node is responsible for each local part of the Tracer session and the Tracer API has a global picture. In that case Tracer node uses the Kubernetes API to extract the list of containers and tasks on that information instead of the proxy system. This is the brief architecture of the Tracer API. There is a logic which periodically discovers all active Tracer node instances and maintains them in a local database. The same REST API is exposed and logic for API proxy, each request is forwarded to all active Tracer nodes. The replies for them are received in the Tracer API aggregated and sent to the client as one reply. The container Tracer does not store the traces internally, it only can read them from those string buffers. So there should be exported to some external database. Trace exporters are used for that. Currently there is only one trace exporter implemented that uses the open telemetry SDK to export the traces to external Jager database. So currently having Jager installed in the cluster is mandatory in order to be able to collect traces. In this short picture of the forward traces and the life cycle of Tracestation when a request is received on the REST API from the user, a trace session is created and in that request user can specify the name of the pots that have to be traced and name of the containers inside those pots that have to be traced. Well cart is supported so we can specify only the part of the name of the pot containers or pots. The session is created internally in the database in that time but the initial state of that session is stopped. A second request is to be sent to start the session. And when the session is started the trace which is specified by the user is attached to all tasks that are currently running in those containers. The trace in the kernel is enabled and the tracing is started. At the same time reader job is started which reads the ring buffer which is dedicated for that tracing session and exports all traces from there to external your database. The session can be started multiple times and each time the trace hook is attached or detached from the containers. Can be deleted. There is a short demo that I recorded to show how all this works. This is the list of the posturing in my development Kubernetes cluster. It has a lot of pots. Container tracer itself runs in its contrast name space. There is only one tracer note in my setup because I have only one note in the cluster and one tracer API instance. This get request can be used to get list of available pots that can be traced. This is the list of all these pots. For each pot there is a name of the pot and list of the containers that are running inside the pot. And also other information that is used for tracing. This get request returns the available trace hooks. Currently there is only one trace hook implemented. This trace is called system calls that are used by given container. This trace hook is implemented using the Ftrace framework of the curve. Here you have the name of the trace hook and short description of it. List request gets the all currently configured tracing session. Currently there is no tracing session configured in our system. We can create a tracing session using this session JSON helper file. Here we specified what we want to trace. All posgres pot will be traced in this session. All pos that has posgres in the beginning of their name. And all containers running inside those pots. The trace is called hook will be used for this tracing session. It's going to be attached to all these containers. And the name of our session will be demo OSS. OK. Our session is created. With this post request we are passing this session JSON file. The session has unique number that can be used to control that session. The container that is going to be traced posgres SQL running inside posgres SQL 0 pot. The initial state of the session is stopped running posts. We can use that helper JSON file to run the session. Just specifying run through and put that file to the REST API with the idea of our newly configured tracing session. Now the session is started. The trace hook system call trace hook is attached to the posgres SQL container. The state of the session is running. Currently it collects the system calls that are used by the posgres SQL container. We can stop the session using the stop.json file. Just specifying run false using the same API just passing the stop.json instead of run.json. The session is stopped. The hook is detached from the posgres container and the trace is disabled in the kernel. Let's see what traces are collected in this session. This is the user interface of the Jager. In my cluster the name of our session is demo OSS. There is a timeline in the top when you can see when those system calls were caught, list of all system calls. There are 400 system calls for 40 seconds collected. For each system call we have the name of the system call the time stamp in the kernel when this event happens. This is the kernel time space and input parameter for the system call. This is system call system break we see it got 2000 bytes from 5.2.8 This is a short overview of how it works. How we can use that information? Currently the project is in its proof of concept stage it is very limited it has only one tracing hook available and only one trace exporter to external database but even with this limited information a lot of interesting use cases can be implemented the system calls are actually the interface between the application and the kernel and all interactions between given container with the external world passes through those system calls. So observing the system calls we get some idea what given container is doing this is suitable for black box tracing if you have no idea of the container I want to get some idea of what is doing the other interesting idea is using the sequence of the system calls their frequency and the input parameters you can train some machine learning algorithm to detect some anomalies in our container for example for intrusion detection or for detecting if there is some abnormal execution due to bug for example The other interesting thing is for workload characterization based on the system calls we can detect if this workload ranking in the container is CPU intensive, memory intensive, network intensive or IO intensive based on that information to make some scheduling decision in the cluster this was my final slide this is the link of the project if you are interested thank you and there is time for questions it depends on what more b system means actually it is highly optimized and I tested it with millions of system calls per second in my laptop it works but it depends on the scale yes, all these frameworks can filter per process ID for example a lot of filters can be outpointed all the filter is by process ID so those tracing sessions implement in that way list of tasks are passed to them and only system calls for those tasks are recorded but it is only for the container tracer there is no requirement for the application to being traced to be rooted yes access to the kernel is only root yes, yes I tested it without Kubernetes in the proc file system there is enough information for that what containers running list of tasks in which containers files oh, this is in the proc file system yes, currently yes because it is proof of concept ok, there is no questions if you are interested let's see how it contributes