 Well, hello, everyone. I'm so exciting to be here at KubeCon Observability Day to see so many attendants from this stage. And today I will be sharing our story of correlating metrics, logs, and chases. I will walk you through our journey from starting with SDKs to eventually bring in eBPFS and enhancement. First off, let me introduce myself. My name is Zhu Jie Kun. And I'm the software engineer and a member of the hotel community. I joined Chuang last year where I've been mainly focused on building an observability platform for our business. Today's talk is going to be divided into several parts. First, I will share our experience with observability and the infrastructure based on then I will cover how we construct metrics and logs from the chases span. Why we need standard signals and how we achieve that. After that, I will talk about why we brought eBPFS into the mix, the challenges we've run into while gathering data with it. And I'll tie everything together with a final wrap up. First, our journey. I promise that this is only propaganda slides. And our business is primarily focused on interest-based social networking. TDChat is our voice app for gamers. It's quite similar to TeamSpeak on the desktop, but it's specifically designed for voice chat in mobile eSport gaming. Going back to our observability journey, we use Prometheus to collect metrics. And because our business and our applications is widely distributed across different crowd providers and whereas Kubernetes cluster. So actually, our Prometheus is deployed alongside Thanos, which we use to provide global curing capability across multiple Prometheus instance and for long-term storage solutions. For chasing and logging, we adopt Tencent Crowd as our vendor. Data is collected by application or log agent and sent to the vendor. We also have open telemetry collector for modifying and filtering spans, providing certain monitoring capabilities and maintaining the flexibility to switch between different vendors. And based on these infrastructure components, we also conduct an observability platform that integrate with our configuration management database, allowing us to categorize application hierarchically and combine observability data into one centralized platform. So last year, my colleagues told me that while the data integration platform is good, it's not much different from searching for data in various different places. Therefore, we started to think about doing more on top of this data foundation. When a user sees a metrics that is on fire, if he has enough experience, he may go to check the corresponding logs. However, if the user is not familiar with it or if the person who noticed the alarm is an SRE on duty and not the application developer, how can he find out the root cause? If we can provide logs and traces corresponding to the metrics, or in other words, link all of them together, things will become much simpler. So we started looking for potential solutions in the community to correlate different signals. Before we dive into the solution we utilized, which is this band matrix connector, it's essential to first explore the universal idea of correlating signals. When we think about how to correlate different signals, we need to find what they have in common. It becomes appear that trace has trace IDs and logs can record those trace IDs. While metrics may not retain every trace ID, they can attach a handful of notable trace IDs while they sample up. So in theory, trace ID enable us to correlate all pieces of data. But practically speaking, in our system, it's more common to see logs without trace IDs and metrics without the exemplar, a consequence of legacy issues. Ideally, we can make changes on a large scale, pushing our developer to include trace IDs in their logs and metrics. However, implementing large scale changes requires significant human power and time. Therefore, we are looking for a solution that can correlate signals with most of the work can be done by our observability team. If you are familiar with distributed tracing, after trace span is reported, it's received, processed, and exported to the vendor by the hotel collector. A trace span contains a lot of information, including the application's name, span type, for example, it's an RBC request or a database query, duration, and response status. Using this information, it's entirely possible to construct metrics line query per seconds and latency error rate, also known as the RED metrics. The open telemetry collector offers a component called span metrics connector that makes this possible. It converts span into metrics like this and carries the attributes of the span as labels, we modify accordingly to export metrics like this, whose metrics name and labels name follows our standard. Based on the similar idea, since span can also carry information such as query statement and HTTP arguments, we also develop a span log connector which exports log to the vendor. More importantly, the data transformed from span will carry a trace ID or an example and we have full control over it. This means we don't need to push users to make any changes in order to get correlated signals. And you may be curious about how much extra resource it needs to add this logic outside the original trace pipeline. We conduct benchmark tests on span metrics connector with workload ranging from 2,500 to 20,000 span per seconds. The purple column and line represents the trace pipeline while the red columns and line represent the trace and metrics pipeline, which include both the span metrics connector and the permithilus exporter. As you can see, there is almost no different in civil usage between the two with additional pipeline causing only a slight increase in civil utilization. When examinating the memory data, the trace and metrics pipeline on average use 38% more memory. This increase is tied to the number of different combination of label values in metrics, also known as the cardinality of label value. And these are the benchmark result and in actual usage, I'm sorry that I cannot share our data from the production environment due to some auditing reasons. Instead, I took the data from one of our largest test cluster, which reports 14,000 spans per second. After transforming them into metrics and logs, the civil usage of the ODEL collector increased by 4% and memory usage went up by 24%. I believe that cost wise, this is a relatively inexpensive solution for obtaining correlatable signals without having to modify any user code. However, as more and more teams express interest in using this solution, we face a challenge due to the diversity of span they reported. For instance, spans describing bicycle requests might have different span name and attribute formats. To generate consistent metrics, we need to standardize those spans using whereas span or chase processor before the transformation, which lead to a growing number of configuration to be managed. This is cry in practical and the more pipeline we have, the more resource they consume and also the human resources needed. Our ultimate goal is to make sure that the metrics and logs we get from chases are all in standard formats. We realize that maintaining different configuration is necessary due to the diversity of span. If there is a way to standardize all the span or at least those important ones we care about such as those for HTTP or Reddisk call, that will be quite beneficial and the configuration of the open telemetry collector could also be simplified. So our task boils down to unifying and standardizing important spans. The first thing we did was to design, standardize the attributes for spans, specifying the enumeration of keys and values. This is our attributes protocol. Well, actually this is not a good practice because O-Tel has already defined a common set of semantic attributes. I would recommend using them if there is no concern regarding legacy issues. But anyway, now we have a standard within the organization. Establishing the protocol was straightforward but pushing everyone to adopt the changes has been challenging. To simplify the changes, we also create middlewares and interceptors for various frameworks and client SDKs. This extension method are provided by the frameworks themselves and most frameworks and client SDKs can be integrated with just one or two lines of code changes. After integration, the span they reported will follow to the protocol we have established. Actually it doesn't matter whether you use a customized SDK middleware or an open source project as long as they can report according to a specific protocol across the whole organization. And so far everything seems to be going pretty well except that we suddenly found our Prometheus instance keeps going down. If you think about it, we are generating so-called standard metrics for those HTTP and MySQL calls and others which obviously includes labels with high continuity. If anyone runs a query without the correct label filter such as service name, cluster name space, Prometheus made crash. To safely use those standard metrics, we've added a PromQL proxy in front of our panels. It passed every request separating the metrics and labels in the PromQL for a specific metrics. If the necessary labels are missing or the query time range is too broad, it will reject the request. We also set a TTL for those standard metrics so if a time series hasn't been updated for a while, it will be dropped from the scrapping result. These are fairly common measures to protect the Prometheus. In addition to this, we also try out an experimental feature of Prometheus which is the native histogram. Native histogram has been introduced many times at past keep call so I won't go over the implementation again. After enabling native histogram, we can observe a change in the precision of the collected data. The resolution of the data on the right is visually higher and this is also from our test cluster and it may not look impressive due to the lack of sufficient samples. I'd like to reuse a graph from previous speaker to show that native histogram provides a significantly higher precision and it comes with the bonus of better performance and resource usage. So check this out. With native histogram on, we are seeing a 30% faster scrap time for those high continuity targets and for queries, native histogram with different bucket factor are speeding things up by 60% to 80% compared to the conventional histogram when dealing with 13,000 time series. The bucket factor impacts the resolution of native histogram. The lower the bucket factor, the finer the detail. And with a bucket factor of 1.1, you get really precise data and it still runs better than the conventional histogram. And so by now, we've managed to provide those standard signals along with the capability to process and display them but in the observability domain, there's always a challenge you've got to face. No matter how easy we make the instrumentation, there are always users who don't want to make any changes. So we need to add the last piece to our observability platform which is the ability to collect data without any code changes. Thankfully, the evolving EBPF technique offers us many options. We went with the flow which is also a CNCF project. In our use case, EBPF can essentially be seen as a network package collector. We deploy the EBPF agent on our Kubernetes cluster to collect the network package sent and receive the bind application which are then reported to the EBPF server. The EBPF server converts them into logs, chases, and metrics according to the standard signal protocol for storage. This workflow looks pretty simple but we've still faced some problems when it comes to using the data in the rig wall. The biggest problem is how to link up the EBPF spans. In our practice, the EBPF agent doesn't modify any network calls so there's actually no application instrumentation involved meaning that the collected spans won't have trace ID or span IDs. Instead, they come with information like TCP sequence or Cisco trace IDs. To display it like a normal trace, we typically need to start with a particular span then searching for related spans using attributes like TCP sequence, Cisco trace ID, and thread IDs. We keep doing this recursively, checking down additional related spans until there are no new spans could be found. And in the end, we saw all the spans by time and TCP sequence relationship among other criteria. The whole process is line animation here. Overall, I think this idea is workable and serve as an implementation method when there is no trace IDs available. In our rig wall use, this method requires significantly more times for each query compared to the traditional SDK instrument when dealing with, especially when dealing with a high amount of spans. And for data collected by the EBPF agent, we need to store it in the click house first which takes up a substantial amount of space. To reduce the disk usage, we need to perform data sampling. Typically, we locate span with error or long duration in click house and by using the earlier mentioned recursive search, we find the trace they belong to along with the associated spans. All those spans are then stored into another click house instance, creating the final sample the data set for user to query. If you are familiar with tail-based sampling in distributed tracing, the concept here is quite similar. The only difference is that the EBPF method use disk instead of memory which incur higher causes and offer less performance. This operation sounds quite heavy, right? We think so. We believe that EBPF program that do not modify any data are generally considered saver. That's our initial impression and the reason why we chose Diffrock instead of other EBPF project that do instrument the application. And in our case, we help to collect observability data as much as possible with least amount of resources. After running this on production for a while, we believe it is necessary to reconsider whether it is worth using so many storage and computing resources. Or in other words, whether it is worth using EBPF for distributed tracing. And finally, after doing some evaluation, we remove the storage of trace data collected by EBPF. Once they are transformed into metrics and logs in memory, they are immediately discard. This save us the space of storage and about 30% of civil use, civil resources, as well as 70% of memory, which is a significant cost reduction for us. And now let's wrap things up. Our approach for correlating signals is based on the span matrix connector. We've modified the hotel collector to enable it to transform span into both logs and metrics. And in order to handle spans reported by different applications, we define a protocol for span attributes and build numerous middleware and interceptors for frameworks and client SDKs. This allows applications to report standardized spans. And for spans that cannot integrate with those middlewares, we also deployed EBPF agent to collect data on their cluster, serving as a supplementary method to user-side integration. And currently, our EBPF pipeline handles over 600,000 spans per second in a single cluster. Although we have abandoned the ability to display chases, it still offers very meaningful support for metrics and logs. We believe that to fully explore the potential of EBPF, it's best to use EBPF program that actually modifies an instrument application such as Gravana Baylor. I remember that their latest release already supports many frameworks and client that we need and we hope to try out in the future. But of course, we also have to consider the actual performance overhead. So that's the story of our observability platform. And as I said, we've made so many mistakes. And for example, not using hotel's semantic attributes as a standard and our work on EBPF is far from perfect. However, there were limitations at the time and I hope you can look at what we went through and say, hey, let's avoid that when planning similar things. I hope the idea of this band matrix connector can also inspire you to build things, beyond metrics and logs because the trace span really carries a lot of information. And that's it. If you are interested in our experience and would like to continue the conversation, you can find me on GitHub or Twitter. And as English is not my first language in case I may misunderstand your question, I will recommend you to scan a QR code or visit the link below which we will direct you to a Google form to write down your questions. And thanks again for joining my session. Much appreciate it. Thank you.