 Good morning, sorry for the technical issue. So welcome to my talk about LinkedIn's observability data pipeline. In today's talk, I'm going to introduce the evolution of LinkedIn's observability data pipeline and how we leverage the open source and open standard to help this effort. So first, a quick agenda. We'll start with the observability data pipeline existing status status, and then what the issue associated with this data pipeline. And then we'll introduce the reanimation work we've been doing with the open standard and open source tools, and finally, the status and the summary. A brief introduction on myself. My name is Chao. I've been working in LinkedIn at the observability team in the past few years with the emphasis of logging and tracing data pipeline. And yes, you can always find me at LinkedIn. So you may or may not notice that in the past few years, we are entering an era of observability. So for any company with a distributed system, the observability becomes the critical support to provide the functionality for diagnosing, debugging, tracking, and detecting bottlenecks. So what is this observability? There are many definitions. One of them is that it provides the insight of the service, runtime state, or property in the distributing system. So what does that real mean? It real means that in the high observability requirement, make it very necessary to provide sufficient context or insight for information about the system by combination of a few different and key types of information together so that you can get the information of the system. You can have the insight of the system. The typical data that we've been using, or be mentioned, is also called the pillars of the observability. Metrics, traces, logs are generally the most common used pillars for the observability. So how does the observability look at LinkedIn? There is no exception at LinkedIn. For years, we build up observability data pipeline to collect different insight and informations into the distributing system. We have logs. We have graphs. We call it ingraphs. We have trace information we can leverage and look through some application called call tree. We have a lot of other tools built around them. However, since this system was designed and evolved and maintained by different teams many years ago, most of the tools built around this data is isolated. And they cannot interoperate between different databases to get more insight out of the data. To meet the ever-increasing requirement of data traffic, LinkedIn has started several projects in the past few years trying to improve this situation. Let's first look into what the previous or existing data pipeline looks like. Starting from tracing, this data pipeline has been around for almost 10 years. An application emitted a message called service co-event into Kafka. And those data finally et out into Hadoop for ingestion. There's applications built around the Hadoop for offline exploring and analytics operations. The pipeline was developed like many years ago. Therefore, at that moment, the scale of the LinkedIn's data is still in very small at that moment. So it quickly reached to its limit. We had to fix this by cutting down the traffic and doing the sampling of data, which limited the identity capability of this data pipeline. Another thing is that we are using HDFS for the data storage. HDFS is mostly used for primary use for to support offline processing. Therefore, some of the runtime issue cannot be immediately analyzed and deployed which is a big limitation. Logging is quite a different story. Many years ago, we had an initiative to try to build up some data to collect data on different boxes and pull them into a central place for acquiring and the analysis. That project was based on some external tools which end up failure or very unsuccessful of this project because the scale of that tool support cannot meet the LinkedIn's requirement. We have much, much larger requirement of the data pipeline. So for years, our engineer or SRE has to log into the different box or virtual machine just trying to grab or to get the log message. So this situation was not solved until recently when we started InLog Data Pipeline project. Basically, we built our own data pipeline trying to ingest the message. The InLog event message was emitted from the application and sent to Kafka and ultimately to Kustor which also known as Azure Data Explorer. So Kustor provides much better ingestion and acquiring capability and makes it very convenient for the engineer to get the log messages which is a great improvement. There's some limitation for this phase one project though. One is that we didn't provide the full capacity to accommodate all the log messages which means that we only have wording and error message ingested for use. This somehow limited the capability of the whole data pipeline. Another limitation is that only Java application has the capability to do the data emission. So that's another thing we like to improve. Matrix data pipeline is not the focus for my today's work. So I'm just quickly touch upon the object and to tell a very brief story of it. So we have similar data pipeline that service metrics event is emitted from application. That can go direct to Kafka or go through the service called AMF to Kafka. This data was then directly emitted to TSDS which is time series data storage. We have a series of tools built around this and to visualize, to generate alerting, generate a lot of different data or an analytics tool. So I'd like to first to summarize that what is the issue for our existing data pipeline and then how we want to improve this. The first issue is obviously that we have a duplicated data pipeline that has similar components with different implementations which cause a lot of maintenance burden and also the tools built around all this data are isolated so we cannot share like using the same tool to access different type of data. Another issue is that I'm going to link Dean Kafka is the only Java is the only language that is supported, fully supported in all this data pipeline. Other language Python goes C++ generally do not have the instrumentation capability so they cannot leverage the data pipeline. The third issue, as I mentioned earlier, we have the capacity issue for all this design that for logging we only allow the warning and our error message to immediate and for tracing we have to do the sampling because of the limitation. To address all these issues, LinkedIn decided to renovate the data pipeline with the consolidated data, to consolidate some of the components. So here's the proposed solution. We started with generate the new data pipeline that can fit both tracing and logging. Kusil again is being used as the data storage which can provide as I mentioned earlier the efficiency of query as well as like ingestion. Another part is that we are introduced open source standard which is open telemetry into our system for help the instrumentation of the application. Open hotel can easily help us for the instrumentation of different language that is one big issue we've been seeing. The third part is that we introduce a new component called observability agent. Observability agent will help to some of the common functionality like data collecting, conversion and also like encoding. This part we actually use a flimbit for this purpose. We're going to introduce more of all these changes in next few slides. Starting from open telemetry. I'm sure everyone here is already familiar with what open telemetry standard is. So we leverage open telemetry in both the tracing standard, tracing schema as well as the sum of the SDKs. With open telemetry we can easily instrument multiple language applications. This is a big advantage. Another part is open telemetry also provide a lot of choices for external tools, utilities that we can let directly average. I think Jaeger was mentioned earlier. We actually evaluate this tool. There's some other tools we also looked into in the past. If you look at the open telemetry data pipeline, an important component called open hotel collector sitting in the middle. Open hotel collector will help to receive process and export telemetry data to its back end. And at LinkedIn we also need this type of agent. We call it observability agent. This agent not only support the hotel like the conventional functionality, we also want it to support like some of the other functionality like data conversion, filtering, as well as the monitoring functionality. So observability agent can support client with different language because this is a standalone process that can sit aside of the application. So applications language really doesn't matter anymore in this case. And another advantage of this probably is not obvious to most people. The separate process have the resource and operation isolation that we look forward. That which means that if something goes wrong with the network, if something goes wrong with the collector, the main application will largely not affect it. So as I mentioned earlier, FluMid was chosen as the observability agent. This decision was made like for a lot of people's surprise because hotel SDK already provide several implementation of the collector. We didn't choose that. We made the decision even before we became aware of that FluMid will support open telemetry. So why we made this decision? There are several reasons. The first reason, as I mentioned, if you look at the past few slides, we have Kafka sitting as our chosen main data backbone. So any output won't go through Kafka and FluMid provide a lot of different plugins to support a different customization, including Kafka. This gave us a lot of advantage using FluMid to directly get this capability. Another part is that logging streaming is the first class citizen in FluMid and we are doing this in log improvement and we need the login support immediately. And when we started this project, open telemetry logging is still in early alpha phase. So we didn't choose the hotel collector. This is another reason. One more reason, the third reason is that FluMid uses rating C, which is resource efficient and with high performance and building resilience guarantee. This is very convenient for us to deploy it in thousands or tens of thousands or hundreds of thousands of boxes because this efficiency will be magnified bigger or like this multiplied with the number of the nodes we're going to deploy. Last but not least, FluMid has been used in a lot of different companies, including LinkedIn already. This will make many people feel comfortable to use an approving solution. Based on all these reasons, we choose FluMid as our observability agent. Obviously, we cannot use the FluMid out of box because it was not the hotel agent yet. So we have to bunch of enhancement improvement. The first improvement we added was like open telemetry protocol support. That's the basically the hotel exporter need to talk to the agent that can understand its language. And open telemetry protocol was the standard. And we add the open telemetry protocol over HTTP support with PortoBuff encoding in it. So that's the first enhancement we added. Next, we also enhanced the metrics emission mechanism inside FluMid. FluMid by itself is already have the building monitoring mechanism that already support, so supported through external tools like Prometheus, I think I mentioned earlier. And this is like the more like pulling a method in the used by Prometheus. But in LinkedIn, we are going to deploy this in a lot of different nodes and pulling method does not work well for our use case. So we want to adapt this metrics monitoring mechanism into our LinkedIn's own metrics collection of flow. That's where we add the adapter. Adapter trying to convert some of the metrics information into LinkedIn's own formats and immediately to a central place so that we can have our own metrics pipeline and collect the data and visualize the results which can also support the alerting or not other monitoring mechanism. So another thing we enhanced for LinkedIn is the tag management. For most people probably already know that tag is used by FluMid for the streaming management because we have now more complicated data pipeline, we need a more stronger functionality in terms of the tag so that we can pass some of the information directed from the client, that the exporter to FluMid without decoding the data. That part was down through the tag management. One sort of issue that hasn't been noticed by many people but that caused a lot of discussion inside the company is that what the deployment model we should have for FluMid. There are two choices we've been thinking. One is like the per box deploying. That is for each box or like the company notes we have one copy of FluMid running there and collecting information. Another is like per instance for example like you have each application instance you have one copy of FluMid inside with it. And there's a pros and cons of each approach. That's why this caused a lot of discussion. So for per instance, it has the advantage of like better like resource and aero isolation. It can avoid the noise in the neighboring problem that we've been seeing in our like the on-prem data center and it make the configuration and monitoring easier. This is obviously like if you have multiple instances of running on the same host and you have the port management issue that if you have the per box deployment model used. On the other hand, per box has the advantage of less resource usage, right? And less notes to agent or less instance to manage. So we have a lot of discussion and it seems that the per instance deployment model was preferred and we like to go that direction to have better isolation and simplified configuration monitoring, but not all tools or like environment support this. So we end up with like a hybrid solution that whenever possible, we'll deploy FluMid as per box, per instance. For example, like for the Kubernetes, we have the side coset support then we are using per part deployment model. But for some other environment like LinkedIn, we can we also support like per box deployment. So with all this enhancement where we are now for this project, FluMid agent is now already roll out in a lot of boxing in the company. We are in the process of the roll out the enhanced FluMid to all the boxing in the company. And currently it's not fully enabling in terms of configuration, but it's sitting there waiting for the traffic from the different applications. Just repurchasing project. We are have some pilot application chosen that is already start to image their tracing information and to collect the data. And for logging, as I mentioned, we have like previous phase of our login which already have Java enabled. Now we have some selected Python applications to start the immediate login message. Okay, a quick summary of the talk I had today. So I introduced the consolidated data pipeline in LinkedIn to support both logging and tracing. There are other works that has by my team is been doing as well. So I haven't got a chance to discuss them. They are equally important and hopefully in the future I can introduce more of them. This effort include like the data partitioning for better correlation and inquiries and visualization. We're going to introduce a new visualization tool to handle the distributed data. And also we are also in the design phase of data sampling and the query data sampling as well. This is mostly for query efficiency as well as like the lower the cost. And we've been always trying to optimize different components to make them more efficient and running it more smoothly. So with that, I think that's all talk I'm going to have. I'm open for any question for speaker. Thank you. And let's see if we have time for question. Yeah, one question. Anyone would like to ask? Hi, you said you've made the enhancement to Fluent Bit. Will you open a PR with those enhancements? Will you send it back again to this project or you keep it in house? I'm sorry, could you repeat the question? You said you've made enhancement if I understood correctly to Fluent Bit to improve with your own data, I suppose. Will you open a PR to the project and send it back again to this project or will you keep it in house? Well, actually, so there's a couple of different enhancement. Some of the enhancement, we are going to contribute back to the open community that is used for everyone, like some of the average enhancement, bug fix and some of the tag management. But some other parties, only specific while doing the conversion internally, nobody outward would need that. We're not going to send them back to open source. Great. Yeah, and I think it's time for a break. So thank you very much again. Thank you.