 Hello, everybody. I'm JBD. I work at Google on our instrumentation team. My main focus on is actually instrumenting Go servers. And I was previously working on the Go team, working mainly on Go's diagnostic tools. Today I will talk about distributed systems observability at Google. So how many of you have heard about observability? I assume that it's, I was expecting more in this audience. Since there are so many conflict definitions of this term, I want to clarify my definition first. What we call observability is this holistic approach to be able to observe a system for properties such as reliability, performance, deployability, and so on. We look at multiple different signals in order to achieve that. Metric collection, distributed traces, profiles, logs, are a few of those. This talk is mainly about the motivation and the core concepts we came up with in the recent years to make Go production systems more observable. So I said signals. I'm not going to favor one signal type to another, but rather focus on how we collect signals and why we collect them the way we do. And this talk is going to mention a lot about metrics, traces, and profiles, but don't assume that these are the only signals we care about. So to give you a little bit of history, Google is dominantly a distributed systems company. One of the most common architectural patterns we use is microservices architecture. We have, you know, 10,000 of different microservices contributed and maintained by hundreds of different teams. And being able to observe our system is a fundamental reason why Google is reliable, fast, and user friendly. In order to be able to observe our systems, we care about instrumenting our systems, obviously. We invented some collection methodologies and export formats as well as, you know, entirely new philosophies in this area. Our instrumentation stack cares about efficiency and the overall overhead of the collection. And I would say that observability is a part of our engineering culture and we enable it by making it easy and also low overhead. Before digging more into the distributed systems observability, I want to briefly explain you why it is different than, you know, observing monolithic systems. This is a typical architectural diagram for pretty much every product at Google. We usually have this user-facing business logic heavy front-end server that depends on other various services. Authentication, billing, reporting are some of the examples here. In this example, all of these relatively low-level services depends on spanner, our database, and eventually hitting the blob storage service for persistence. At NWICO services architecture, it's very expected that some of the services are becoming such a common dependency. So when the rest of the company is, you know, dependent on blob storage, it's harder for this team to gather meaningful metrics, traces, profiles, et cetera. It's hard for them to tell the root cause of the problems triggered by their users. Blob storage team will see some fluctuations maybe in their dashboards, but we'll have a very hard time breaking down the data and figuring out where the problem is actually originated at. It's also not only when things are obviously going wrong. Infrastructure teams always want to have some answers to be able to just say that, you know, things are going right. Some examples of the questions they need answers for, hey, are we meeting the SLO for the spanner team? Are we providing them the service we promised to serve? As the blob storage, you need to be able to tell. What is the impact of this high-level service on this low-level blob storage service? And what happens if this particular product scales up 10 percent overnight? Is the blob storage deployment going to be able to handle the new scale? So this is why we want to be able to break down our signals in various different ways. We call these different various ways dimensions. With dimensions you can query the collected data in ways that will help you to, you know, answer some of the earlier questions I had. Give me the blob storage request latency distribution for RPCs originated at, you know, Google analytics front-end server, for example. Or give me the traces and reports contains the specific RPC method. Or give me the CPU profile for this library for the RPCs just generated at Google analytics. So it's great that you know that we can query this data, but how do we really collect signals in order to be able to query them this way and break down? The answer is we record the data with various key value pairs. We call these key value pairs tags. And then the backend, for example, a metric collection backend such as Prometheus can filter data by tags. So the entire promise of, there's something very confusing here. The entire promise of microservices is that you have no tight coupling between services. How can a low-level, you know, service such as the blob storage can tag with the right thing if they don't know nothing about its dependence and their, maybe, you know, business cases? This is where we get help from context propagation. So the tags are actually produced at the high-level services and pass to the lower-level stack as a part of the RPC. You can see that from all the way up from the bottom, you can see the RPCs are tagged. So the blob storage doesn't have to know anything but can record the signals with the incoming tags. We have a culture of, you know, producing these tags at high-level services, depending on the specific requirements of the teams, and we, you know, propagate these tags all across the stack as a part of RPCs. And each component in the system can record metrics, profiles, and so on with the incoming tags. As I mentioned in the beginning, we see observability as a holistic approach because each signal type is useful to answer different questions. For example, distributed traces are not going to be able to tell you about the CPU hotspots or CPU samples cannot tell us about the overall latency end-to-end. So we collect various signals and examine them from very different perspectives and break down with the tags. It's impossible for developers to think about all these dimensions and signal types and build highly efficient instrumentation libraries and, you know, instrument each layer they depend on. That's why we built a common framework and decided to open source and make it when they're agnostic so anybody can use them with any provider. Recently, we announced that we are our project OpenSensors, which is a holistic instrumentation framework. It is inspired by Google's internal project called SENSUS. The main reason we are open source on this is we want to feel that missing, you know, building gap in the open source world. We want libraries, frameworks, and all sorts of infrastructure projects to be able to implement these instruments without having to reinvent these concepts. We also want to, you know, help other organizations to adopt these solutions and if they're not, they can also use OpenSensors as a reference. So OpenSensors provides a single set of libraries. We have tags, metrics, traces, and more is coming in the future. We have language support available today for people to go Java and C++, Python, PHP, JavaScript, C-Sharp, and Orlang are coming next. The libraries, our instrumentation libraries are vendor agnostic. So you can upload data to any backend. We currently have support for Prometheus, Zipkin, Yeager, and some APM vendors. Some other APM vendors are also thinking that this is a useful solution to instrumentation rather than inventing their own instrumentation libraries. So they're working to provide OpenSensors support. We provide out-of-the-box integration with some of the frameworks, such as GRPC and HTTP packages. Also, libraries provide inspection and can render a tiny dashboard from the process that, you know, contains a summary of what is going on in the process. Without having to rely on an external service, you can see what is going on in the scope of a server. And it's a very useful thing when you know that the problem is at a specific process, or you can use it during the development time to see what is going on. So speaking of framework integrations, I just want to show briefly what it looks for GRPC. At Google, we're also responsible for all the GRPC's study observability, and these integrations are what we are planning to use internally at Google. For now, you need to import this plugin and pass it as a stats handler to the GRPC client and the servers. In this case, we're looking at the GRPC server. You can see the new server stats handler. In the handler, you can extend the incoming tags from the current context. In this case, I'm inserting hello as the originator server and inserting the user ID as well. Then it will be possible for the backends to break down the collected data with originator service and user ID. This is how to record values. I have a measure total hello that represents the number of times we said hello. Stats record will say one with the tags in the current income context, so you will be able to tell the number of hellos for the request, originator, whatever service, or by a specific user. Then, you know, this is how it looks like in your dashboard. You can break down the data by dimensions. In this case, the baby blue one is the total number of hellos from the RPCs originated at the out service. The purple one is the ones coming from the billing and the other two colors are representing other services. The GRPC plugin also automatically creates traces for the incoming and outgoing RPCs, but you can also add custom spans by using our trace packages. Here, we are creating a custom child and finishing it. You can create as much as you want and annotate them. Just propagate the context and whoever starting new spans, whoever the new spans will be basically direct children of the current span in the current context. Here is an example of the traces collected from an RPC. You can see the internal RPC is made in order to satisfy the original incoming request. OpenSense also provides PPROF support. If you use tag do, we collect actually CPU samples with the tags inside the incoming context. Then, you can see the hotspots in your code for specific requests, RPC names, and so on with the dimensions you have defined as tags. This is the GRPC server profile with OpenSense. We are looking at the typical visualization of PPROF data. You can see the runtime concat strings spend 9.43 seconds for RPCs coming from the authentication service and 3.20 seconds for the RPCs coming from the authentication service. Let me focus on some of the core principles we have. One of our goals is to make instrumentation as much as possible without our engineers thinking too much about the cost. This is why we have a separation between instrumentation and collection. Because instrumentation is cheap if you don't collect them, like if you just drop whatever is collected, it's very cheap. Rather than collecting all the metrics, we defer it to the end user to enable the collection. The instrumentation bits are left here and there and almost zero impact on the critical path. The end user, the end developer usually decides what to collect. This allows libraries and frameworks to instrument without worrying too much about the cost. They provide some measures and then users are enabling metric collection, for example, on the provided measures. As I mentioned just right now, the collection requires explicit enabling, but this is also true for disabling. This allows us to dynamically enable or disable collection and production. For example, imagine a jizzab library that has been instrumented to measure the compressed chunks. Until you're suspicious about this library, you don't have to collect any metrics, but if you do, then in the production you can enable collection and start receiving metrics. Observing becomes very easy when you have a static list of things to observe, but systems are usually surprising you. That's why we're encouraging a model that you can dynamically expand whatever you are collecting. We sample expensive and large data. Everything that is cheap to collect and aggregatable is usually don't have to be sampled. Examples of sample signals are traces because they are very big and profiles because they are very expensive. On the other hand, we aggregate data in efficient ways to produce cheap and small data to avoid sampling. This is what we do for metric collection, for we never have to sample our metrics. It's great that you don't have to sample the metrics because this is how you see all the 1999 percentile stats. I was repeating that the data size could be a reason why we aggregate or sample data. One of the obvious other reasons is we want to be able to limit the outbound bandwidth that is spent on data collection. For signals that are aggregatable like metrics, we try to aggregate them in process or near process and to reduce the bandwidth. We are either aggregating them in the process or by an agent that is living very close to the process and do the aggregation there. At Google, we try to use the same instrumentation libraries everywhere. When we are providing black box monitoring, we are trying to use the same libraries for compatibility. For example, a trace can be started at our load balancer and then our engineers can use the same libraries to add more spans to the existing trace. Similarly, our microservices frameworks like GRPC provides tags and a course of instrumentation out of the box. It is easy for our engineers to facilitate what is already there and put stuff on top of it rather than thinking everything from the various scratch. One of the other useful tools we have at Google is this introspection pages. You can imagine introspection is a small backend that collects and visualizes what is collected in the process. It is a great useful tool to understand what is happening without having to rely on a backend and also useful during the development time, as I mentioned. Here you see the traces page. There is a small dashboard that displays spans from different names and gives us the overall latency distribution. You can see the details for ten samples for each distribution bucket and the errored ones. You have a clue about what is the main reason, main problem. To summarize, we have a holistic approach. We use multiple signals, metric traces and more. Tags allows us to break down data by dimensions. Each team can produce them and pass them to the low level services. We instrument our core frameworks and service meshes and load balances out of the box. Users automatically get a lot of instrumentation through them. Then they can use the same libraries to add fine grained details, just like in the case of creating custom spans. Our instrumentation layer is optimized to be low overhead and low cost. It makes it easier for libraries and frameworks to instrument without thinking too much about the cost. We will share these concepts and put them in place. It gives you a good foundation layer. OpenSensors is very similar to all the approaches we have done internally and currently is available and vendor agnostic. We already have support for Prometheus, Zipkin, Yeager and more. I highly encourage you to take a look and give us feedback and contribute. Thank you so much. Testing. Okay. Do we have any questions? Your hand up. Yes, one there. Thank you. How does this relate to open tracing? It's a common question we have. One of the easy answers is we have a holistic approach. Open tracing is focused on tracing. The way open tracing sees themselves is more of API, not a real implementation. Data models are quite compatible. We can currently say that if there is open tracing instrumentation there, we can actually convert it to our instrumentation and still use our back end exporters to export the data. We are talking to open tracing right now to see if we can actually get on a similar page in terms of naming. But it's a little bit difficult because we have tags, for example, dimensions decoupled. Whereas open tracing puts this as a part of the tracing API. So we're discussing and seeing what we can do. Next question. Is this open? Okay. So one of the problems of monitoring and collecting data from services is when a service has a problem or crashes, the data that it generates, it's much higher. Depending on the design of the application, it can be five times, 20 times more. You talked about aggregators on the node. I wanted to collect, I mean, what is the rule of thumb? How much data should produce more? So it really depends. But at Google, we have this other concept of having aggregators outside of the process. So we usually have agents located nearby and collect raw metrics sometimes and like use that additional agent to export the final thing. And it depends on the latency expectations you have and the expense between your process and the agent. The question is because we face this problem is what would be your advice on the percentage of like in size, how much data would make sense to produce? We were talking about specifically like how much data we should buffer. Yes. It's again hard to say, like it's really hard to say, like there is one typical rule. So you just need to experiment for your case. And again, you need to, I think, be able to depend on different models to export. You know, you can export from the agents rather than the process or the process. I mean, I cannot like tell one rule. There's one perfect rule that works. Next question. Hand up. Is that school? Because it is a school? Okay. So I'm going to ask a question. Sure. You say that it's very low overhead. What would be the typical number of nanoseconds taken for an event you're throwing away? And nanoseconds of what? For, you said the instrumentation is low overhead. So if you have an event for which there's no use of it, Fred Ron, what sort of cost? So the ideal thing is we drop the events. We collect metric collection, for example. It happens this way. We collect record events. We drop them if there is no one subscribed. And we need to still iterate those events. And it's just like very minimal. Usually we just omit it. I don't have like real benchmarks for the Go implementation right now, which I am responsible for. So I cannot tell. And this project is just, you know, in the early stages. But for C++ implementation internally, we just like really don't care too much. Any other questions? Going three, two, one. Thank you very much.