 Hi there, thanks for joining us today. This is Wish from the Tai-KV community. I'm an infrastructure engineer at PINCAP and also a core contributor of Tai-KV projects. I'm very glad to share how we improved Tai-KV observability with my colleague, Zheng Chi. Recently, we add tracing events to the Tai-KV with very little overhead. Let's say only a few nanoseconds per event. And we would like to share you about how we did it. Hopefully, this is helpful. First, let me introduce you to the Tai-KV project we are working with. Tai-KV is a key value database. It is distributed, transactional, and open source. It has recently become a graduated CNCF project. So far, there are more than 8,000 stars and more than 200 contributors in GitHub. As a key value database, Tai-KV accepts key value read and write requests. For example, get and put. Sometimes, there are jitters. A write request may suddenly take a long time while most others are normal. This can be caused by different reasons and we would like to know why. Now we have logging and matrix. Logging is not very useful in this case. It is usually hard to link logs related to a single request together. And for matrix, sometimes it is not useful as well. Matrix only review aggregated information like average latency. And when multiple payloads are mixed together, a jitter from a single request is hidden. So as a result, we want to use trace to know how jitter happens. Tai-KV is written in Rust and there are multiple tracing libraries available. These libraries are compatible with open tracing or open telemetry. This seemed to be nice, but we immediately missed some challenges. And the most tough one is performance. Jitter happens very rarely, for example, only once a week. This means we need to trace all requests in order to not miss it. And as a key value database, each request takes very short time, only a few microseconds. Thus, the tracing facility must be super efficient, negligible compared to a few microseconds. The second challenge is that there are multiple batch systems in Tai-KV. For example, these systems receive multiple incoming requests and processing together. Like in this picture, multiple write requests are accumulated and then a single disk write is performed. Some tracing libraries are not designed for this case. We would like to review all details. To resolve these issues, we had to develop our own tracing library. I would like to invite Zheng Chi to share this part with you. I am going to introduce our tracing library then mini-chase. Meaning is that it's lightweight, concise, and focused on performance. It's still a POC prototype so far. We are working on letting it be stable and production ready. We decide and develop mini-chase with the prime goal of high performance. Here's the results of micro-check, micro-benchmarks, and integration benchmarks. On generation and collection of lifespan, with .02 microseconds latency, our tracing library was 17.5 times faster than was tracing and 100 times faster than top-tier tracing. We traced 100 events by different tracing libraries. Then we coded the QPS of point gap requests. As you can see, while was tracing half origin QPS, in the trace only down six percent. I will explain what optimizations we've done for such a performance. The first key to the performance is to reduce contention. Contention happens when a share resource is accepted when currently in multiple-stress or close. In most most implementations, spans from multiple-stress are simply pushed to the same span collector, which is globally shared. Threads assess and modify the same resource, cause contention. They have to pay the overhead of logs or atomic variables for every span. WinNChase doesn't push one span each time to the global span collector. Instead, it collects spans to thread local buffers first. After works in thread are done, spans in thread local buffer are then collected to the global collector in batch. In this way, the global collector is assessed much less often. The contention is reduced and the performance improved. The second key to the performance is to time faster. Let's see how a busy span looks like. For each span, the tracing library recalls when the span is studied and when span is ended. So timing performance is important. Commentaries and libraries either use system time to retrieve the time or monotonic time counter by assessing the monotonic clock. In our environments, each monotonic clock assesses 25 nanoseconds latency. If we have 10 spans in a key value gap request, the total latency courses by tracing becomes 500 nanoseconds. Remember that 10KV gap requests may take about 1,000 nanoseconds to 3,000 nanoseconds. So this has 16% latency overhead. Another choice is to use clock monotonic costs It is fast and results in 3% latency overhead. However, its precision is only 4 microseconds according to our benchmarks. Each limit is usage. Instead of using these clocks, we use the time-stank counter-register available in modern Intel and AMD City use. Its value can be assessed via the RDTSCP instruction. The TSC register is very efficient with high precision. It only causes 8 nanoseconds each assess in our environment. However, TSC is not perfect. In some CPUs, TSC is not synchronized in different clothes. A hardware-synchronized TSC can be discovered by checking some CPU flex. Even with this flex, we discovered that TSC may be not synchronized due to an unstable environment or some CPU force. MiniChase carefully handles these situations to ensure that TSC value is reliable and can be used to measure the time. When MiniChase detects that TSC is not available or not reliable, it will fall back to use clock monotonic course. The final key to the performance is to reduce civilization. Civilization happens when spent on clarity in the memory and leads to send back to some chasing storage that a year ago. Since there are very frequent key value requests in the TCKV, the chasing result reporting is also very frequent. Civilization may take a long time to reduce the civilization costs. In TCKV, we collect all spends related to requests, but only selectively put them to the chasing storage. The selection is based on the request latency, only requests that take long time will be recorded. This is different to the sample collection that we will not miss any jitters. To chase batch systems, MiniChase supports too much context related to different requests into a single context. Then this request is shared following child spends. Finally, shared spends are collected separately to collectors of these requests. We are glad to see some related words in the community. A subset of open chasing is implemented for performance. MiniChase supports to refer to Yeager. The amazing Yeager UI greatly eases our verification worse. Here is GitHub repository link of MiniChase. You can use MiniChase in your own projects now. Some organization will be contributed to open telemetry rust. We hope one day the official rust kind can adopt all organizations. The upcoming TCKV 5.0 will support the chasing feature provided by the MiniChase. Hope you enjoy this talk. Welcome to contact us through the following channels.