 How's it going? So I suppose if you're going to be interested, then you've probably had an application that you didn't quite understand before. So if you've understood every app that you've ever seen, then you can just go for a coffee break, I think. So this talk is observability three ways. I'm using the term observability. That's what we used in Twitter back when I worked there. But some people call it telemetry or monitoring. But one of the things that's been quite common, I've noticed, especially working in distributed tracing, is that people have existing tools. They have their log analysis and rollups, and they have their metrics, and they may or may not have tracing. They may have an APM system with dashboards and things. And it's sometimes hard to tell where something starts and when something stops. So that's one of the things I'm going to try to take on and let me know how it goes. So the good news is that there's a unifying theory between these things. Coda Hill mentioned this to me, which is that everything ends up being based on events. And so if we think of the atoms of observability, these are the events. They don't break down much below that. And in the most simplest case, logging, we can think of as just putting events on a timeline. And they may have structure. They may not have structure, but that's the simplest measuring we can do and recording we can do. Metrics, on the other hand, are themselves events that are derived from these things, summarizing usually. So you might take 10 events of request latency and then bucket them. And then that bucket is now a new type of event. And that can continue, for example, if you change granularity of time to get requests per minute instead of requests per second or something like that. And tracing is also events, but it has a really interesting nuance to it, which is that you can tell that happens before a relationship. So for example, you can tell that this request, in fact, calls this next request independent of just timing. So for example, if your clocks are all a mess, you would still know that this customer request ended up causing this back end request. So at the end of the day, they're all giving us insight based on events or new ones. And where they focus is a good place to kind of tease apart one from the other. So like logging, many times when we're developing at Huck, we may find that some crappy error happened. And so that might just be dumped right out to a log. Either we did it or the system did it because we didn't trap it. But at any rate, exceptions tend to end up in logs. And if you roll around, exceptions are something that are also usually kept in traces, or at least that there was an exception. But it can tell you more about that, for example, the impact. So if you do have a happens-before-relationship, meaning this led to this, and this error happened, well, what was before it? Was that a customer request? Was that a redundant request? You can tell a lot more about the impact of that specific thing in the system. And if we keep going around the circle here, well, tracing often has a focus on latency because we know latency is really important to us. If something takes too long, we often give up. We're very impatient. And so we want to tell about latency, not just from one request, but how our whole apps are doing. So we'd like to know, in aggregate terms, like what's our latency? And for a cluster or for a node, and metrics tend to be really good at focusing on the aggregate nature of these events. But if you keep spinning around, you can even find that your metrics include the amount of error accounts over time and things. And so it's neat that they have these focus areas, the places where they all overlap. One of the things that they all intersect at is request-based information. So if you think about logs, usually if you have a HTTP request coming into your system, it might have a side effect in a log file. It might have incremented a counter in metrics. And it might have also been in a trace. So right in the middle, you have a place for all these three tools to give you this information about a similar, in this case, maybe an endpoint. But on the other hand, there are things which they don't really overlap. So for example, in logging, you'll have things that are non-request-scoped in nature. So for example, if you're running a garbage-collected system, you might have garbage-collection events. Or you might have audit events, which aren't necessarily going to end up in a trace. And in metrics, we may have things that are not system in nature. So for example, like counts or any of our value metrics about our application performance. If you're at Netflix, then you might want to know what your streams started per second of video streams, which is maybe related to some of these other things. But it actually is something that you would use metrics for that aren't necessarily system in nature. And within traces, there's definitely a good overlap between logging and metrics, but that calls out all the information, isn't usually also kept in the other two buckets. So they have focus. They have overlaps, but they also do have some unique things to add. And so that's one of the reasons why hoping this talk helps with folks. Because we often have one or two, or maybe we're considering adding something else. And what do you get with that? What does it buy me? Because all of these decisions cost something. So let's understand what our values are. So to drill down a bit more, I like to try to make it a bit practical. So if we use that latency, the thing that's in the center of all of those, then we can talk about these three tools, which are not the universe of all tools, but these three that I have time to talk about today. And in common terms. So let's take response time. In a log, you'd end up with response time as in a log line, or the difference between two of them. Metrics are obviously very good at storing numbers. And response time can be represented as one. And of trace, we'll get into that. But essentially, they often have a focus, a central fact, which is latency. A lot of these things where you see some things dumped is just because I'm trying to make it a bit more practical. So for example, this might be a HTTP logging format some people have seen before. And the neat thing about this format is it crams a lot of information in a line, which is good. But also, it adds latency as a trailing field, which is nice because usually if you're trying to do napkin latency, you'd have to compare numbers between two different lines. And this particular format happens to place the request latency as the last field. So you might be able to get an accurate number for how long something took, which is not necessarily going to be messed up by clock adjustments from NTP or something. So in this case, we're looking at something which dumped out a microsecond because microservices. And that turns out into 95 milliseconds, which who knows? Like, we know that's the time. But if you've worked with developers before, you notice that when somebody says something is slow, they mean all sorts of things. They could mean it's 30 seconds. They could mean it's 30 milliseconds or whatever. It's very ambiguous until you keep asking questions again and again. Metrics is one of those things that can help us take facts like a latency figure and then answer, well, is it actually slow? Or what is that number? What's the relevance within the population of my system? Because our systems have slow times because we're under a crunch. Maybe if we're running in Twitter itself, then during Super Bowl, there's going to be a lot of activity. It's going to be hitting things. And the system is going to behave differently during that time, even if it behaves well. And so metrics, if we take a fact, for example, at 2.19 PM, this 95 milliseconds, well, A, was that a slow request? Well, maybe if it was very slow, alarm would have gone off. But we could still see within the population at the time how our requests doing. And this is obviously not a real app, but though I would love to see an ASCII art monitoring app. But in this case, the system was particularly troubled prior to this. But at the time that this request happens, it was a bit slower than other requests, and not because the system in general was slow. And so that's useful information when we're trying to understand requests in our system. If we look at traces, they really kind of drill down on a single request. They can help us triage and troubleshoot customer issues directly. So instead of saying, well, generally speaking, the system was OK. But that doesn't answer the customer who may be having some tail latency, which is not represented in our 95th or whatever. So in this case, we can look at individual requests if we're lucky enough to have a trace and understand why was it slow. And this is good because when we're doing troubleshooting, we need to be able to triage and rule out areas that we don't need to explore because time is important. We want to resolve failures quickly. And so if we happen to have a trace like this and yellow meant a failed request, then we would know that the overall customer request was successful. It didn't actually break the upstream because the top bar isn't yellow. But it was delayed. If it didn't have a failed network request, it would have been faster than normal, and it would have been perfectly fine. And so we have a choice to make. Is that a good enough answer? Is whoever is on the other side of the phone, are they going to be OK with that? If so, you're done. You don't have to do anything else. You just know that it was a network to cause delay. And you don't even have to blame the database team. I mean, so it's really nice to have a trace because you can get a good idea of what was actually responsible for some delays. So you'll have your own thoughts if you don't. Here's some you can take. I kind of feel like logs are easiest. We tend to learn logging. What is Hello World, except for our first log statement? And these things are easy to grab, easy to understand. You'll be able to have a lingua franca in any programming language that somebody can understand what a log is. Metrics are neat because they give us this ability to understand trends within our system. They add the relevance, which helps us because everything is very subjective when we're working with systems. And traces can help us understand a specific request or even a combination of events with more sophisticated tooling. So for example, folks at Dynatrace showed me some neat tool where they could reconstruct a postmortem from trace data to actually show the system dying and coming back together. And that sort of information is neat and can be built with trace data. So how do you write this code? And we don't usually spend a lot of time writing timing code. But we could, and some of us do. And at the end of the day, each of these tools would have different approaches to writing timing code because of how they store data and how they report it. So logs, generally speaking, are either delimited or formatted in some way. And so you cram latency data in there. Or you rely on externally provided formats, such as that there's going to be a timestamp field to the left. Metrics are all about numbers and storing numbers. So we'll see examples of that. And tracing is where we're starting propagating this idea of this overarching request through the system. The jargon here is that it's often called a span, which represents a single operation within your overall request. When I say overall request, I mean that usually, especially with microservices, we could have a front end request going into the system. That system is usually not a monolith, so it's going to break down into multiple requests. And maybe it hits memcache, and maybe it hits the database, and other things. And so each of those things would be a span, and the overall request is the trace. So logging, I took some code from OK HTTP, which is my favorite Java library for HTTP communication. And there's a logging interceptor. And so essentially, it usually looks something like this. You have some stopwatch. And then you cram the latency someplace. And so it's about formatting. And the metrics, on the other hand, this is Scala, if you haven't seen that language before. But usually you would take from a stats or a metrics registry. You'll get something that's scoped to an endpoint, or at least a type of data, like request latency. And then you just add numbers to it. So for example, how you collect the number is different within languages and libraries. But again, it's usually a stopwatch type of activity. And you just sync it to this statistics gatherer thing. Tracing is a bit more complex because it has more state to it. It's out of the two I mentioned so far, this one's actually stateful. Because when I say that there is an overall request flowing through the system, guess what that means? There is state, and it's moving. And so you have to actually make sure that when something goes in one side, it doesn't get lost before it goes out the other side. So this is code from Lyft Envoy, which is a C++ proxy aka mesh. And this is C++, and I've dumbed it down a bit. But essentially, you create a span representing the operation, and you make sure that it isn't lost by the time the finish callback happens. And that finish implies often enough stopping the timer. And then at that point, it can be sent out. When this trace goes between systems and its HTTP, usually you'll see a trace ID header and a span ID header or something like that. And so that's how it manifests, and that's how it gets from point A to point B. So the impact of this, logs are ubiquitous, like Hello World, but we do require coordination because if we're going to write a format, we're assuming somebody can read it, whether it's a human or a system. So that's a type of coordination, either you're coordinating based on human intuition or a format or something like that. Metrics, in my opinion, are the easiest APIs to work with because you're just putting numbers into them. And it's hard to get that wrong. You could put the wrong number in, but it's hard to put the number in incorrectly. The tracing is the hardest because it's doing the most work because it's actually having to carry state through the system and make sure that that's coherent. Because otherwise, if it didn't, you could be looking at a trace, which isn't actually what happened, and that could confuse you and cause damage instead of helping you. The next point we'll talk about is, OK, we can definitely do that type of stuff, should we? Should we, I think, is one of the less discussed questions in engineering. Things like, yes, we could, should we? And the reason I say that is because usually frameworks do a lot of this stuff for us. We may write our stuff from scratch without frameworks, but often enough, if you look, they have capabilities that I've mentioned inside the box. And also, even though I've dumbed down the examples, there's lots of edge cases in timing code. So for example, I hinted at clock skew is a problem. So for example, the clock on one system may be different than the clock on the other system. And if you passed out of between them, how do you remedy that? Even within the same box, the clocks might correct their time. And so for example, if it moves the clock backwards, it could look like your request took negative time to complete, which would be awesome, but not really very realistic. There's all sorts of edge cases, which are fun, but not necessarily something that we all need to know or work on, unless we're hobbyists. So how do you not see tracing code? There's a number of ways of doing that. And I'll try to break down a few. One is like a buddy, which means you have some other process that's doing it for you. And so it's intercepting your code, or your process, or your container. And that's going to be taking your responsibilities, which we're seeing more and more often in these service mesh type of deployments. Another way, which is quite popular and probably the most popular way for performance management tools or agents, which will monkey patch or otherwise change the code as your application boots up and puts in instrumentation points, which will capture either latency or other types of information on your behalf. And then another way to not see code is to use frameworks, which then can be configured to intercept your code. So buddy tracing, this is actually an image from Lankerdee, who's one of the few options out there in the service mesh space. And so it's an example of where you have some sort of a sidecar that's responsible for your inter-service communication. So you might be sending your request to a local host, and then it's actually doing an outbound request on your behalf. And the neat thing about service meshes is that they can do neat things. For example, Lankerdee has the ability to give you a special route. So if you wanted to send a percentage of your customers to a new feature, it could propagate a special route to send them to your test cluster based on some policy data. So propagation isn't just for trace IDs. Making sure trace IDs go from point A to point B. But it could be used for deadline guards to make sure that if a request took too long to finish, that it doesn't just keep proliferating load throughout the system and other metering, for example, if you're doing any metering information. So bodies usually do more than tracing, actually. But one thing they do tend to do is tracing. Agents are super powerful things. This is an example of a byte buddy, which is a Java agent library. And the way these things work is that as your application is booting up, it will literally change the code to, for example, trace it or do other things. And if you look underneath of a lot of performance management tools, that's how the secret sauce of App Dynamics or something would work, is that they're actually doing automatic tracing for you. You wouldn't ever need to see the code that's doing that. But of course, with open source, we can see some of it. And the interesting thing about agents is that there are things that you can do in agents you just can't do with code normally. So in Java worlds, for example, you may have thread pools and things. And you can't actually touch them. They're implicit objects. And agents can touch anything. So they're pretty powerful. And that's why a lot of performance tools use them. Frameworks, this is an example of a ton of annotations representing how to trace something in Spring Boot. You don't have to use this. It adds a file to your class path. And then magically, it does this on your behalf. But frameworks often know how best to trace one thing. And that's neat because it also invites the authors of the libraries to do that work. And so you get a high chance of it working out well. And so frameworks usually have a configuration approach, whether that's placing a jar or placing something into a plug-ins directory. But you have something, or a big YAML or Hocon file. You have something that's configuring the ability to trace stuff. And they have their purrs and cons. But if you don't have any choices, usually you choose one that you have. So if I don't have an agent, then I can't use an agent. So therefore, I might use a framework. And if I have none of these options, then maybe I would end up in a case where I have to write the code myself. All of them, at the end of the day, have to ship that data out of process. And this is a pretty interesting thing about observability and worthwhile mentioning. Because a lot of the cost of a system is what ancillary things are going on. Remember, I was asking Netflix once about why they don't do a lot of tracing. And it was because it would end up being a more expensive system than actually running Netflix. And so we have to always be careful about things, like how much data we collect, how long we retain it. And each of these tools have very different ways of, or basically different things to add to this discussion. So logging, usually, we have a pipeline. And that could be like ELK or whatever your favorite way to get logs out from your apps into where they can be analyzed. And often, it's parsing in nature. Whereas metrics are summarized. Oftentimes, you'll get, they're summarized in the app itself. And then maybe that's converted to some request per second. But then your system might summarize it again to request per minute. And as it's going through the system, it might be summarized. And that has a lot of gravity on not just the shipment, but retention of data and how metrics are very different than logs, for example. The other thing is that metrics are often near real time, which is another, if somebody says, like, if you have a problem with somebody being unspecific about the word latency, try the word near real time. Because somebody might think it's a second, or they might think it's 30 seconds, or they might think it's 200 microseconds. So again, all these things about time always ask the clarification. What do you mean by that? Traces tend to have a similar latency, like a read back interval, and are expected, in most cases, to be available as soon as a request has happened. So when we look at the nitty gritty, if you've used log stash that you've probably seen something similar to this, we've been parsing logs so long that we actually have tools to help parse them, like groc, which is a way to typify how we store things like IP addresses and numbers and such. And so you can basically take these patterns, these coordinated things, like whether they're intentionally or unintentionally coordinated, and create things that we could possibly roll up later on. And of course, like I said, in the beginning, things are derived from events. You may actually have metrics that are produced from your log output, too. So that may be, in fact, what you're doing here. Bucketing can be done in many ways. We hear a lot of things in statistics. We hear things about percentiles and histograms and buckets and things. And this is an example of a way to classify requests that goes with my earlier diagram, which is like, OK, we have some boundaries. Maybe they're coordinated up front. They certainly are in this image. And so we know that anything below a millisecond is super fast, and anything beyond 50 seconds is super slow. And then we just, as the data goes into the system, we just increment a number according to that classification. And that has a neat side effect, because it dramatically reduces the amount of data that we're sending out of process. Why? Because shipping the number one isn't a heck of a lot different than shipping the number one million. It's 64 bits, or 32, actually. Depends on how you ship it. But at any rate, it's not a direct line from the counter requests. And that's a powerful tool we can use. Spends, on the other hand, in order to build these graphs that show us what a request looked like, we do have to retain a lot more information. The parent-child relationships, at least the duration, and usually some lookup tags and things. So how does this impact our data on disk or not on disk? Well, logs we know, it grows. That's the first thing we know about logs, is that they grow. They fill up disks and things. And they grow with traffic, although they grow with other things, like errors that have nothing to do with traffic. But they definitely grow with traffic. And they also grow with verbosity, because logging has a usual function, which is like a debug level or a trace level. And if some developer is trying to discover more, with logging tools, usually they will turn on the verbosity, which then could end up adding more data in the system versus just local to that process. Metrics are neat, because they're fixed size with regards to traffic. That's why I was talking about how shipping the number 1 isn't that much bigger than shipping the number a million. And so that means that your data will grow based on the amount of things you're measuring, like your endpoints or how many types of data that you're collecting per endpoint. But the traffic itself isn't a primary function for size. And there's definitely nuance around that. If you have a lot of customer dimensions, you can definitely get more metrics. But it's not anywhere near as directly linked as logs and metrics would with regards to traffic. And this tells us how we can reduce this volume in our system. So if we're looking at logs, well, the first thing you'll have anybody tell you is stop logging things, because if you have irrelevant data going into the system, you're just adding weight to the system. That's why a lot of these tools at the collection ingress points have filters and other things, because we know that oftentimes we have data that we can't even control being placed into logs. So how do we handle that? That's a normal function of logging pipelines now. Metrics, when I was in the observability team at Twitter, we had a kind of read your rights initiative, which is that it's very easy and interesting to put metrics for everything. But then you could end up with writing 1,000 more times information than you're ever going to read. And that just adds load and cost to your infrastructure. So if you're trying to look at things about controlling volume of metrics, well, you can use things like a course or granularity instead of aggregated on seconds, on minutes, or something. But also, and then within maybe the last five minutes has this, but the last day has that. But also, just more simply read your rights. So one thing that I tend to do in open source project work that I'm on is I tend to ask why when people add metrics, it's like, are you anticipating someone to use this? Because in a lot of cases, you can actually get by with what frameworks do, and frameworks often will give you the ability to mute metrics a lot easier than if you were using custom code to do that. So tracing its primary way of reducing volume is sampling, which that means is that, say you have 100 requests going through your system, maybe you choose to keep a trace for five of those. And so that would be probabilistic. So you'd want to make sure that your rate isn't so small. So for example, if you have a front end like Twitter, you could get by with a very low sample rate, like maybe even a 1,000th of a percentage of requests. But if you have a less used endpoint, you might just take 100% of those because you're not going to overload your system. So there's a little bit more choices to do. If you do sampling with traces, you have to be consistent. So unlike logs, like logs sometimes are shipped over UDP transport. So you can have some accidental sampling because packet loss. And you might have even 10% of your data that's just lost. And that could be OK, depending on your use case. Traces, if you have lost data, it can actually be disruptful because you're using this to tell what actually happened in the system. So the sampling should be consistent. And a lot of the tools, for example, the one I work on, Zipkin, is always like a decision up front. And then that's carried through to make sure that it's either all or nothing. And then finally, on this volume thing, is how long is data kept? So logs and metrics, we tend to see people with policies of 30 days or much longer on logs. And traces are interesting. We had, I think, a three-day policy at Twitter on trace data because it was primarily used for triage. So if you put them all in the same system, having different retention policies per type of data can be helpful to you because there's some cases where, for audit requirements, you may have to keep certain log data a lot longer. But that doesn't apply to metrics. And maybe it doesn't apply to your trace data. So you have a lot of tools. So if observability is in one big classification and everything there has to be stored for a year, then you're looking at a very expensive retention policy. So hopefully, some of these things I mentioned about this, which of course I'll post slides you can look at later, can help you make good decisions based on your environment and what advantages you can take from these and decide what retention could be useful. And here we are. So we have these systems. We know a lot about them. And I haven't even really mentioned too much about individual implementations. Like you could be using commercial tools or Prometheus or LogStash or Kibana, whatever. They do work together. And it's not just that they do similar things. But often if you have correlated tags or lockup keys or ways of classifying the data in the same way, indexes, then you can use them together. So you'll find that in tracing, usually people, if they're doing tracing, they also plop the trace ID into the logs. So that way, you can correlate them together. And you can find things, like usually exceptions, that you might not actually happen to have in your trace data. If you're looking at metrics, you may have an RPC name or an endpoint name, which is captured commonly between these two tools. And that way, you can easily transition between metrics which might show something that's awry and either representative traces that are in that exact same time period, or at least you can look them up somewhere else. And because the granularity, as you get to wider and wider scopes, you have a higher and higher chance of a granularity match between the three systems. So like cluster, host, data center, things like that tend to be a type of tag that you could use and go across all of the types of sources. And so when you stitch these things together, you can use them together. And even if there are different retention policies and things, and see broader or constrained contexts of the same events. So I mean, I've talked about a lot of things, but I think if we try to distill down what I hope you could come away with is that logging metrics and tracing are different tools that all participate in our help and help us to understand what our apps are doing. And if we know what these are, we can leverage their strengths and understand their weaknesses as opposed to combat them against each other. And that's going to help us with the actual environments we have. Many of us still have monoliths. Logs are great there. That doesn't mean they're bad for microserver lifts. It just means that there are different ways that we can use each of these tools. And you wouldn't necessarily turn off one to turn on the other. And through the ability of identifying patterns and exception cases and understanding individual request and assistance, we have a lot of power to answer questions not limited to why this is slow, but actually, what is it doing? In microservices, we're getting larger and larger architectures, and we're giving a lot more flexibility to developers. In some cases, we may not be able to read their code, but we could at least identify what's happening from a system level. And as requests go across the system, that gives us a lot of power for triage. We're getting people on the same page. And if you felt this was helpful, one thing I would encourage you to read is Peter Bergeran's blog in general. But he also had a blog that inspired this talk, which talks about logging metrics and tracing. And also, a lot of folks that review this content, who I work with in tracing land, including Baz, who works on this thing called Go Kit, blogged in from Google, who is working on Census, which is like the successor to DAPR. Generate from Critio, which is a pretty high-volume service over there in Paris. Nick from Nike, who has some interesting open source libraries. And then Kota Hale, who has as well, Felix, who runs a small open source APM called Stage Monitor, and Abhishek from Amazon X-ray. But if you didn't like it, then please don't share that with us. Just give it right here. And then maybe the next talk will be even better. But thanks for your attention. And if you have any questions, I'm ready for them. So we have lots of questions, but we don't have a lot of time. So I'm going to try and just pick one or two. So if this is a good one, someone asked, when the tracing is hidden, i.e. it's not done explicitly in the code, how do you automatically verify a test that all the things are configured correctly, that the tracing is happening, all of that sort of stuff? Yeah, so if you don't trust your framework, then what you would do is you would treat it like other things that you don't trust, which are black boxes. So you could, for example, send like a micro integration test, which would have a tracing endpoint and then just verify that you can read back the trace. I do that actually in my apps to verify that. So one is look at the. Guys, if you're leaving, can you not talk, please? So one thing is to look at the test that exists for whatever you're using. But also you can just, in any suspectful system, you can always black box. And there's another question about W3 of Adada proposed server timing spec. I don't know if you're familiar with it. Do you think it will have any change in the way you monitor or report on performance? So there's, interestingly, more spec works going on now. And in fact, even in tracing, I've just gotten more information about a trace spec for headers and things like that. I think one of the important things that I try to work on with groups is making sure that folks collaborate, even if they don't share a single spec, because it's helpful to do that. And I think that there's been some unfortunate cases where, for example, in your browser tool kits, there's a lot more going on there. And also at the kernel level, there is perf events and things like that. And I think over the next year or two, you'll find more links between these type of specs. And it's up to us to also ask folks if we see a tool that isn't using something we're interested in, whether it's a spec or otherwise, because not everybody is aware of everything. And then two Zipkin-specific questions. Is there a hosted managed SAS version at all of this? There was like New Relic plugins, things like that. So two questions there. So Zipkin was originally inspired by a Google technology called Dapper, which is now called Census. And the hosted service of that is called Stackdriver. There is actually a proxy that you can send to Zipkin apps data to Stackdriver. And it's actually a free service. And there's also others that are accepting Zipkin data too. If you have existing monitoring agents, I would say that it's kind of a moral work in progress for those to send to a private Zipkin install. And in fact, I'm going next to Linz to talk with Dynatrace who are working on something like that so that their stuff can actually report into Zipkin system. So just keep an eye out. There's a few more questions, but Adrian will be on Slack afterwards to answer them. I'm sure. Let's give you a round of applause. Thank you. Thank you.