 Hello, I'm JBD. I actually currently now working on the instrumentation team at Google. I used to work on the Go team and recently I was specifically focusing on the diagnostic tools. And today I will talk about distributed systems of observability at Google. So first, how many of you have heard about observability? Okay, that's a lot. A lot of fans. So since there are many conflicting definitions of this term, I just want to clarify my definition first. What we call observability is this holistic approach to be able to observe systems for reliability, performance, deployability, and other such properties. We look at multiple different signals in order to achieve that. Metric collection, distributed tracing, profiling, login are a few of those. So this talk is going to be more about the motivation and the concepts that we came up in the last almost 10 years to make Google production systems more observable. And I will also briefly what we do for Go services. So I said signals. I'm not going to favor one signal type to another, but rather focus on how we collect things and why we collect them the way we do. This talk is going to mention a lot of metrics, traces, and profiles, but don't assume that these are the only signal types we care about. So to give you a little bit history, Google is a dominantly distributed systems company. One of the most common architectural patterns we use is the microservices architecture. We have thousands of different microservices built and maintained by hundreds of different teams. And being able to observe our systems is a fundamental reason why Google is reliable, fast, and user friendly. In order to be able to observe our systems, we obviously need to care about instrumenting our systems. We invented collection and data export, collection and export formats as well as some, you know, philosophies in this area to achieve that in our scale. Our instrumentation stack cares about efficiency and overhead of the collection. So observability is a core part of our engineering culture and we enable it by making it easy and also low overhead. Before digging more into the distributed systems observability, I will explain you why it's a little bit more complicated to observe a distributed system. This is a typical architectural diagram for pretty much every product at Google. So we usually have this user-facing business logic heavy front-end server that depends on various other services. In this case, you see that like auth billing and reporting are those immediate services, the front-end server is dependent on. In this example, all these like relatively low level services depend on Spanner and eventually hitting the Blob Store for persistency. At any micro systems architecture, it is expected. Some of the services are becoming a common dependency like the Blob Storage or Spanner. When the rest of the company is depending on Blob Storage, it is harder for this team to gather meaningful metrics and profiles and so on. It's hard for them to tell the root cause of the problems that is triggered by their users. So Blob Storage team will see some fluctuations in their dashboards but may have a hard time actually breaking down the data and figuring out where the problem is originated that. It's also worth to mention it's not only when the things are obviously going wrong. Infrastructure teams often want answers to be able to just say that things are going right. Some of the example questions they want to ask about their systems are, hey, are we meeting the SLO for the Spanner team as the Blob Storage team? Are we providing them a good enough service, the service we promise to serve? Or what is the impact of this high level service on the Blob Storage service? Or what happens if this product grows 10% overnight? Is the Blob Storage deployment going to be able to scale? And what are the next steps for us? This is why we want to be able to break down our signals in various different ways. We call these various different ways dimensions and with dimensions you can query the collected data in ways that will help you to answer some of these questions. You can query, for example, you can say, give me the Blob Storage request latency distribution for RPCs originated at Google Analytics front-end server. Or you can say, give me all these traces and reports that contains the specific RPC. Or give me the CPU profile for this specific library. But only for the RPCs, you know, the cost we have seen observed for RPCs started at Google Analytics. So it's great that you know that we now can query this data, but how do we actually, you know, collect the signals in order to be able to query the data this way and, you know, break down the diagnostics data by multiple dimensions? The answer is we record the data with various key value pairs. We call these key value pairs tags at Google. And then the back-end, for example, if it's a monitoring back-end such as Prometheus can filter the data by tags. So the entire promise of microservices is, you know, there should be no tight coupling between different services. But then, you know, how can a low-level service such as the Blob Storage service can tag correctly, you know, if they don't know anything about the, about its dependence in their, you know, business cases? This is where we get some help from the world of context propagation. So the tags are produced at high-level services such as the analytics front-end server and then passed all along to the low-level services as a part of the RPC. You can see that from all the way bottom, all the way up to the bottom, you can see the RPCs are tagged. So Blob Storage actually doesn't know anything but it records with the incoming tags. So the data we collect at Blob Store has all these dimensions. So we have a culture of producing tags at the high-level services depending on, you know, specific requirements of the teams, and we propagate these tags all across with RPCs. Then each component in the system can record, you know, metrics, profiles, etc., with the incoming tags. As I mentioned in the beginning, we have a holistic approach because each signal type is useful to answer a different question. For example, distributed traces are not able to tell you about CPU hotspots or CPU samples cannot tell us about the end-to-end latency problems. So we collect, you know, various signals, examine the problem from, you know, various different perspectives. It is, you know, we shortly realized that it's very hard for our developers to think about all these dimensions, signal types, and build highly efficient instrumentation libraries and instrument each layer they depend on. That's why we built a common framework and decided to, you know, open source it lately and make it vendor agnostic so that everybody can use it against any provider. Recently we announced that we are our project open sensors. This is a new holistic instrumentation framework. It's inspired by Google's internal sensors project. The main reason we are open sourcing this is we want to feel that missing building block in the open source world. We want, you know, libraries, frameworks, and all sorts of infrastructure projects to be able to instrument without having to depend on an vendor and without, you know, having to reinvent these concepts. So we also want to, you know, help other organizations to adopt these solutions because we already have built them or, you know, you can use open sensors as a reference implementation. So open sensors provides a single set of libraries. Currently we have a tag library, metrics, and traces, and we will have more in the future. We have language support for Go, Java, C++ right now. There are more languages are coming. Python, PHP, JavaScript, C++, and Erlang are the next. Our libraries are vendor agnostic and can upload data to end back end. We have support for Prometheus, Zipkin, Jager, and some APM vendors. And some of the APM vendors actually want to, you know, utilize these libraries rather than inventing their own libraries. We provide out-of-the-box instrumentation for some, you know, frameworks such as GRPC or, you know, common HTTP libraries like net HTTP package, for example. Also, you know, our libraries provide some introspection and can render a tiny dashboard to report to usage from a single process. Without having to really rely on an actual, you know, external service, you can see what is going on in a single process. It is extremely useful if you know that, you know, the problem is coming from one specific process or, you know, you can use it during development time. So speaking of framework integrations, I'm going to, like, give a little bit of snippets from our GRPC integration. At Google, we are also responsible for our internal GRPC study observability. So these integrations will be used internally all across at Google, too. You need to import our plugin and pass it as a status handler to the GRPC client and servers. In this case, we are looking at a server, but it's pretty much similar for the clients. And then in the handler, you can extend the incoming tags from the incoming context. In this case, I'm just, you know, inserting hello is an originator's service and inserting the user ID. So it will be possible to break down all, you know, collected data even at the very low-level services by originator and user ID. So this is how you record values. I have a measure here, total hello. That represents a number of times we said hello. Stats record will save one with the tags in the current incoming context, so you will be able to tell the number of hellos by, you know, originator's service and by specific user. Then in your dashboard, it looks like this. You break the, you know, down, you break down the data by dimensions. The baby blue here is representing the total number of hellos from RPCs originated at the odd service. And the purple one is coming from billing, for example. And two other colors are representing the other originators. The GRPC plugin also automatically creates traces for incoming and outgoing RPCs, but you can also add custom spans using the trace package. Here we have, we are creating a custom child and finishing it. And then, you know, you can create as many spans you want and annotate them. Just propagate the context and whoever's, you know, starting new spans from that context will be able to trace the existing trace. So here's an example of the, you know, the output, like all the traces collected for an RPC. You can see the internal RPCs made in order to satisfy the original incoming request. OpenSense also provides support for PPROF. If you use tag, our tag package, you can collect CPU samples with the tags inside the incoming context. And then you can see, you know, the hotspots for specific, you know, requests, RPCs and whatever you have put in your tags. This is an RPC, GRPC server, I profiled with OpenSense. We are looking at a typically, you know, visualization from PPROF data. You can see the runtime concave strings spend 9.43 seconds for RPCs coming from authentication and 3.20 seconds for the RPCs coming from the analytics service. So let me focus on some of the core principles we have. One of our goals is to make instrumentation as much as possible without our engineers thinking too much about the cost. That's why we have this, like, separation between instrumentation and collection. Instrumentation is very cheap, actually, if you don't have to collect. So rather than collecting all the metrics, we defer it user to turn on the collection. The instrumentation bits are always there and almost have zero impact on the critical path. And then, you know, the end user can decide what to collect. This allows, you know, library instrument without thinking too much about it, and then they provide their measures and the end user can enable collection. The collection not, it requires explicit enabling, but this is all the truth for disabling. It allows us dynamically enable disabled collection and production. For example, imagine a compression library that is instrumented to measure the compressed chunks. Until you're suspicious about this library, you may never need metrics coming from this measure. But when you do, you can dynamically enable in the production and start receiving metrics. So observing and becoming very easy when you have a static list of things to observe, but systems are always surprising you. You cannot really, you know, you cannot predict what you should observe. That's why we encourage a model that you can dynamically expand the collection. With sample expensive and large data, everything that is cheap to collect and a gradable doesn't have to be, you know, sampled. Examples of, sample signals are traces, for example, because they are large profiles because they are expensive. On the other hand, we may aggregate and, you know, in efficient ways and produce cheap and small data to avoid sampling. This is what we do for metric collection, for example. So we don't have to like sample metrics at all. It is great because then you can see, you know, your 19th, 99th percentile. I was repeating that the data size could be a reason why we aggregate or sample data. One of the other things is we want to limit the outbent bandwidth spent on data collection. So for signals that are aggregatable like metrics, we try to aggregate them inside the process or in an agent living near the process to reduce bandwidth. At Google, we try to use the same instrumentation libraries nowadays to provide, you know, black box instrumentation. For example, a trace started at a load balancer, you know, can be started at a load balancer. Then our engineers can use the same libraries to, you know, add more spends to that trace. You know, similar micro services framework, GRPC provides tags in the core set of instrumentation all of the box. So it's easy for our engineers to, you know, to facilitate what is already there and build on top of that rather than thinking about instrumentation from very scratch. One of the useful tools we have at Google is the introspection pages served from the servers. You can imagine this introspection pages as a small backend that collects and visualizes data. We don't have great CSS right now because, you know, distributed systems engineers don't know CSS much. But you can contribute. You know, it's a great useful tool to understand what is happening in the process without really depending on a backend. Also, you know, as I mentioned, it's great for development time. You see the traces page here. There's a small dashboard that displays the spends with different names and, you know, just displays the latency distribution. You can see the details for 10 sample traces for each bucket and the errored one. So it gives you some detailed, you know, keep just a small sample in memory and you can get more to understand what is wrong. To summarize, we have a holistic approach. We use multiple signal types. Tags allow us to break down our data by dimension so each team can produce them and pass them to the low level services as a part of their RPCs. We instrument our core frameworks and, you know, load balances and service meshes out of the box. Users, you know, automatically get a lot of out-of-the-box instrumentation. And then, you know, they can use the same libraries to add fine-grained details, just like the case in the GRPC, you know, server handler where I was creating custom spends. Our, you know, instrumentation library is optimized to be low overhead and low cost. It makes it, you know, easier for libraries and frameworks to instrument without thinking too much about the cost. I can say that, like, once you adopt these concepts and put them in place, it gives you a good foundation layer. OpenSensors is already available today and is vendor agnostic. We already support Prometheus, Yeager, Zipkin and more. And more exporters are coming soon. So I really highly encourage you to, you know, take a look and, you know, give us feedback and contribute. And that's all. Thank you.