 Hello, good afternoon. Does anyone actually get the reference to my title? Like, I see, like, too much of it. Yes, so it's a reference to a really awesome book, Thinking Fast and Slow. I highly recommend it. No relation to this actual, like, talk, though. So yes, my name is Lynn Root. I am a site reliability engineer at Spotify. I also do a lot of open source evangelism internally. And you might know me from Pie Ladies as well. Also, unfortunately, I'm going to take up, like, the whole time. So if you have questions or want to chat, you can come join me for a convenient coffee break right after this. Okay, another quick question. Has anyone read the Site Reliability Engineering book, aka the Google SRE book? I think I see a few hands. All right, well, I highly recommend that book, but the TLDR of, like, every chapter seems to be used distributed tracing. So with the prevalence of microservices, where you may or may not own all the services that a request might flow through, it's certainly imperative to understand where your code fits into the grand scheme of things and how everything operates with each other. So there's three main needs to trace a system, performance debugging, capacity planning, and problem diagnosis, although it can help address many other issues as well. So while this talk will have, like, a slight focus towards performance debugging, these techniques can certainly be applicable to other needs. So I have a bit of a jam-packed day-to-day. I'll start off with an overview of what tracing is and the problems we can try to diagnose with it. And I'll also talk about some general types of tracing we can use and what key things to think about when scaling up to larger distributed systems. And then the inspiration for this talk stemmed from me trying to improve the performance of one of my own team's services, which sort of implies we don't really trace that Spotify. So I'll be running through some questions to ask and approaches to take when diagnosing and fixing your services bottleneck. And finally, I'll wrap up with some tracing solutions for profiling performance. And as I mentioned before, I won't have time for questions, so you can catch me right out there. All right, in the simplest terms, a trace follows the complete workflow from the start of a transaction or a request to its end, including the components that it flows through. So for a very simple web application, it's pretty easy to understand the workflow of a request, but then add some databases, separate the front end from the back end, maybe throw in some caching, have an external API call, all behind a load balancer, then scale up to tens, hundreds, or thousands of times. It gets kind of difficult to put together workflows of requests. So historically, we've been focused on machine-centric metrics, including system-level metrics like CPU, disk space, and memory, as well as app-level metrics like requests per second, response latency, database rights, et cetera. Following and understanding these metrics are quite important, but there's no view into a service's dependencies or its dependence. It's also not possible to get a view of a complete flow of a request, nor develop an understanding of how one's service performs at scale. So a workflow-centric approach allows us to understand relationships of components within an entire system, and then we can follow a request from beginning to end to understand bottlenecks, hone in on the anomalistic paths, and figure out where we need to add more resources. So when looking at a very simplified system where we have a load balancer, front-end, back-end, database, maybe an external dependency to a third-party API, and when we have redundant systems, it gets particularly confusing to follow a request. So how do we debug a program of a rare workflow? How do we know which component of this system is the bottleneck? Which function call is taking the longest? Is there another app on my house causing distortion of machine-centric metrics, or performance metrics, something like the noisy neighbor's problem? So with so many potential paths that a request can take with potential for issues at each and every node and edge, this can be mind-numbingly difficult if we continue to be machine-centric. So end-to-end tracing will allow us to get a bigger picture to address these concerns. And looking at the magnitudes of what we're operating at Spotify, you can see that tracing, if we did it, would help us a lot. So real quickly, there are a few reasons why we trace a system. The one that inspired this talk is performance analysis. This is trying to understand what happens at the 50th or 75th percentile, the steady-state problems. And this will help us identify latencies, resource usages, and other performance issues. We are also able to understand questions like did this particular deploy of the service have an effect on latency of the overall whole system? Tracing can also clue us in on anomalistic request flows, the 99.9 percentile. The issues can still be related to performance or it can help identify problems with correctness like component failures or timeouts. Profiling is very similar to the first, but here we're just interested in particular components or aspects of system. We don't necessarily care about the full workflow here. The fourth one, we can also answer questions of what a particular component depends on and what depends on it, particularly useful for a complex system, complex systems. So when with dependents identified, we can also attribute particularly expensive work like component A, a significant workload with disk rights to component B, which can be helpful when attributing costs to teams and service owners or component owners. Then finally, we're able to create models of our entire systems that allow us to ask what if questions, like what would happen to component A if we did a disaster recovery test on component B? So there are various approaches to tracing. I'll only highlight three of them here. The first is manual, it's also very simplistic, where you are just generating your own trace IDs and adding them to your logs. And there are very simple things that can be added to your web service here, especially ones that do not have dependent or depending components that you don't have access to. You won't get any pretty visualizations or help with centralized collection beyond what you typically have with your logs, but it still can provide insight. So this is a flask example, super simple, using a decorator. Here you can simply add a UUID to each request received as a header, then log at particular points of interest, like at the beginning and end of a request, and then any other in-between components or function calls where you want to propagate headers. And this is exactly what I ended up doing for my service, which may be wished for a better way, hence this talk. I must admit I do a lot of conference-driven development. So if your app is behind Nginx, that you're able to manipulate, you can also turn on its ability to stamp each request with an X request ID header, as you see here with the add header and proxy set header. You can also add a very simple, like you can simply add the request ID to Nginx's logs as well. Next up is black box tracing. This is tracing with no implementation across the components. It tries to infer the workflows and relationships by correlating variables and timing within already defined log messages. So from here, relationship inference is done via statistical or regressional analysis. And this is easiest with centralized logging, and if there's somewhat of a standardized schema to log messages that contain an ID or a timestamp, it's particularly useful if instrumenting an entire system is too cumbersome, or you can't otherwise instrument components that you don't own. And as such, it's quite portable, and there's very little to know overhead, but it does require a lot of data points in order to correctly infer relationships. It also lacks accuracy with the absence of instrumenting components themselves, as well as the ability to attribute causality with asynchronous behavior and concurrency. Another approach to black box tracing can be through network tapping, using S-Flow or NFDEMP or IP table packet data, which I am sure the NSA is quite familiar with themselves. And then the final type of tracing is through metadata propagation. And this approach was made popular by Google's research paper on Dapper. And so components are instrumented at particular trace points to follow causality between functions, components, and systems, or even with common RPC libraries like GRPC, and that will automatically add metadata to each call. So metadata that is tracked includes a trace ID, which represents one single trace or workflow, and a span ID for each and every point in a particular trace, like a request sent from client, a request received by server, server responds, and then the spans start and end time. So this approach works best when the system itself is designed with tracing in mind, but not many people do that, right? So this avoids guesswork with the inferring causal relationships. However, it can add a bit of overhead to response time and throughput. So the use of sampling traces limits the burden here on the system and the data point storage. Sampling anywhere between 0.01% and 10% of requests is often plenty to get an understanding of a system's performance. So when starting to have many microservices and scaling out with many more resources, there are a few points to keep in mind when instrumenting your system, particularly with the metadata propagation approach. So in terms of what to keep in mind, and I'll go into detail about each in a second, we want to know what relationships to track, essentially how to follow a trace and what is considered part of a workflow, how they are tracked, constructing metadata to track causal relationships is particularly difficult. There are a few approaches each with their own fortes and drawbacks. And then how to reduce overhead of tracking. The approach one chooses in sampling is largely defined by what questions you're trying to answer with your tracing. And then there may be a clear answer, but not without its own penalties. And finally, how to visualize. The visualizations needed will also be informed by what you're trying to answer with tracing. All right, so what to track. When looking within a request, we can take two points of view, either the submitter point of view or the trigger point of view. So the submitter point of view follows or just focuses on one complete request and doesn't take into account if part of that request is caused by another request or action. So for instance, the evicting cache here that was actually triggered by request two is still attributed to request one since its data comes from the first request. The trigger point of view focuses on the trigger that initiates the action. We're in the same example, request two evicts cache from request one and therefore the eviction is included in request twos trace. So choosing which to follow depends on the answers that you're trying to find. For instance, it doesn't really matter which approach is chosen for performance profiling, but following trigger causality will help detect anomalies by showing critical paths. All right, how to track are essentially what is needed in your metadata. This essentially boils down to, it's very difficult to reliably track causal relationships within a distributed system. Now the sheer nature of a distributed system implies issues with ordering events and traces that happen across many hosts and there might not be a global synchronous clock available, so care must be taken when deciding what goes into crafting the metadata and that is threaded through an end-to-end trace. So using a random ID like UUID or the X request ID header will identify causal related activity, but then tracing implementations must use some sort of external clock to collect traces. And then in the absence of a global synchronized clock or to avoid issues like clock skew, looking at network send and receive messages can then be used to construct causal relationships because you can't exactly receive a message before it's sent. And a lot of tracing implementations use this very simplistic approach. However, this approach lacks resiliency. There's a potential for data loss from external systems or inability to add trace points to components that's owned by others. Tracing systems can also add a time stamp derived from a local logical clock to the workflow ID where this isn't exactly the local system's time stamp but either a counter or sort of a randomized time stamp that is paired with a trace message. So with this approach, we don't need the tracing system to spend time on the ordering of traces it collects since it's explicit in the clock data, but parallelization and concurrency can complicate understanding these relationships. And then one can also add the previous trace points that have been already executed within the metadata itself to understand all the forks and joins. It also allows immediate availability of the tracing data itself as soon as the workflow ends because there's no need to spend time on collating or establishing the order of causal relationships. But as you can imagine, metadata will only grow in size as it follows the workflow adding to the payload. So it basically boils down to this. If you really care about payload of requests, then a simple unique ID is your go-to, but at the expense of needing to infer relationships. You can then add a time stamp of sorts to help establish explicit causal relationships, but you're still susceptible to potential ordering issues of traces if data is lost. Now you may add the previously executed trace points to avoid data loss and understand the forks and joins of a trace while gaining immediate availability of trace data since causal relationships are already established, but then you suffer in payload size. And then there's also the fact that there are no open-source tracing system that actually implement this last one. So end-to-end tracing will have an effect on runtime and storage overhead no matter what you choose. For instance, if Google were to trace all web searches despite its intelligent tracing implementation, it would impose a 1.5% throughput penalty and add 16% to the response time. I won't go into very much detail, but there are essentially three basic approaches to sampling. First is head-based, which will make a random sampling decision at the start of a workflow and then we'll follow it all the way through to completion. The next one is tail-based, which will make the sampling decision at the end of the workflow, implying some caching going on here. Tail-based sampling needs to be a little bit more intelligent, but it's particularly useful for tracing anomalistic behavior. And finally, unitary sampling, where the sampling decision is made at the trace point itself and therefore prevents the construction of a full workflow. So head-based is the simplest and probably most ideal for a performance profiling, and both head-based and unitary are most often seen in current tracing implementations. I'm not quite sure if there's a tracing system that actually implements tail-based. All right, what visualizations you choose to look at depends upon what you're trying to figure out. So Gantt charts are popular and definitely quite appealing, but it only shows requests from a single traced, and you definitely have seen this type before if you've looked at the network tab of your browser's dev tools. When trying to get a sense of where the system's bottlenecks are, a request flow graph, AKA a directed ASILCA graph, will show workflows as they are executed, and unlike Gantt charts, can aggregate information of multiple requests of the same workflow. Another useful representation is a calling context tree in order to visualize multiple requests of different workflows, and this reveals both valid and invalid paths that a request can take, best for creating a general understanding of system behavior. So what the takeaway here is, there's a few things we need to consider when we trace a system. You should have an understanding of what you want to do, of what questions you're trying to answer with tracing. And certainly there will be other realizations and questions that come from a traced system. For example, with Dapper, Google is able to audit systems for security, asserting that only authorized components are talking to sensitive services, but not without understanding what you're trying to figure out, you may end up approaching your instrumentation incorrectly. The answer to this question will help identify the approach to causality, whether from the trigger point of view or from submitter point of view. Then another important question, how much time do you want to put into instrumenting your system? Or can you even instrument all parts? This will inform the approach that you use to tracing, be it black box or not. If you can instrument like all the things, or at least some of it, it then becomes a question of what data you should propagate through an entire flow. And finally, how much of the flows do you want to understand? Do you want to understand all the requests? Then you should be prepared to take the performance penalty on the service itself, and then you can have fun storing all that data. Or is a percentage of the flow is okay? And then if so, then how do we approach sampling? And that's in your answer of what we want to know question. So for understanding performance, head-based sampling is certainly fine. You also need to think about whether or not you want to capture the full workflow of a request or only focus on a subset of a system. And this will also inform your sampling approach, be it unitary or not. And so in terms of performance and understanding where bottlenecks are, you want to try and preserve the trigger causality rather than submitter, as it shows the critical path to that bottleneck. Head-based sampling is fine, as we don't need intelligent sampling. And even with very low sample rates, we can get a good idea of where our problem lies. Since we essentially care about the 50th or 75th percentile. And then finally, a request flow graph here is ideal since we don't care about anomalistic behavior. We want information of the big picture rather than looking into particular individual workflows. And so most often, once you are tracing a system, the problem will reveal itself as will the solution, but not always. So I do have a few questions to ask yourself when figuring out how to improve a service's performance. First one is, are you making multiple requests to the same service? Round-trip network calls are expensive, and perhaps there's a way to set up batch requests or accept batch requests on your end. Perhaps your service doesn't need to be synchronous or unnecessarily blocks. For example, if you're some big social networking site, can you grab a user's profile photo at the same time that you pull up their timeline while you try and grab their messages at the same time? Is the same data being repeatedly requested but not cached? Or maybe you were caching too much or maybe not the right data? Is the expiration too high or too low? What about your site's assets? Could they be better ordered to improve loading time? Can you minimize the amount of inline scripts or maybe make your scripts async? Are there a lot of distinct domain lookups that add time with the DNS responses? And how about decreasing the number of actual files referenced, or maybe minifying and compress them? There's a bunch of stuff that can be done with the front-end part. And then finally, perhaps you can use chunked encoding when returning large amounts of data. Are you otherwise able to have your servers produce elements of response as they are needed rather than trying to produce all elements as fast as possible? All right, now probably the most interesting part. So about the current tracing systems that are out there. So there is an open standard for distributed tracing, allowing developers to instrument their code without vendor lock-in. And they do this by standardizing the Trace-Span API. One criticism I have of open tracing is that they don't prescribe a way to implement more intelligent sampling other than a simple percentage and setting priority. There's also a lack of standardization for how to track relationships, whether submitter or trigger. It's pretty much all submitter. And it's mainly just a standardization for managing the span itself. But mind you, it's very young specification that's evolving and developing as we speak. There are a few self-hosted popular solutions that do support the open tracing specification. Probably the most widely used is Zipkin from Twitter, which has implementations in Java, Go, JavaScript, Ruby, and Scala. The architecture setup is basically the instrumented app sends data out of band to remote collector that accepts a few different transport mechanisms, including HTTP, Kafka, and Scribe. So with propagating data from a service, all of the current Python libraries only support HTTP. There's no RPC support. And Zipkin does provide a nice Gantt chart or a waterfall chart of individual traces. And you can see, you can view a tree of dependencies, but it's essentially only a tree with no information, like latencies or status codes or anything else. Using PyZipkin on which other libraries are based, you can define a transport mechanism like I did here with HTTP transport, which can be just simply posting a request with the content of the trace. You can otherwise make one for Kafka or Scribe. But then, otherwise, it's just a simple context manager being placed wherever you want to trace. IGAR is another self-hosted system that supports open tracing specification. It comes from Uber. Rather than the application or client library reporting to a remote collector, it reports to a local agent via UDP who then sends out traces to a collector. Unlike Zipkin, which supports Kafka and Elasticsearch and MySQL, IGAR only supports Cassandra for its storage. The UI is very similar to Zipkin with really pretty waterfall graphs and a dependency tree. But again, nothing to help aggregate that performance information we're interested in. Their documentation is also horribly lacking, unfortunately, but they do have a pretty decent tutorial to walk through. Their client library for Python is a bit cringe-worthy. So this is a trimmed example from their docs, just meaning to give the gist here. Basically, you can initialize a tracer that the open source or that the open tracing Python library will use and create a span, a child span with context managers. But their usage at the end of time.sleep for yielding to IO loop. It's a bit of a head-scratcher. Its docs also make mention of supporting monkey-patching libraries like Requests and Redis and Uralid 2. So all I can say is use at your own risk. After I presented this at PyCon a couple months ago, like the day after they created an issue and basically made a comment in their code, reasoning why, but I still don't get why. So there are a couple others I'm not familiar with, including AppDash and LightStep. And there are a few more that don't have Python client libraries yet. And in case you don't want to host your own system, there are a few services out there to help. There's Stackdriver Trace from Google, not to be confused with Stackdriver Logging. So unfortunately, Google has no Python or GRPC client libraries to instrument your app with, but they do have a REST and RPC interface if you feel so inclined. But they do support a Zipkin Traces where you can set up a Google-flavored Zipkin server either on their infrastructure or on yours and have it forward traces to Stackdriver. And they actually make it pretty easy. I was able to spin up a Docker image and start doing traces within a couple minutes. Annoyingly, they have a storage limitation of 30 days, same with their logging. And my last criticism is their UI. They have simple plots of response time over the past few hours and a list of all traces that are automatically provided in the UI, but you have to manually make analysis reports for each time period that you're interested in to get all the fancy distribution graphs that are not automatically generated, unfortunately. And then finally, Amazon also has a tracing service available called Xray. I only set up their demo app, but it looks like they do not explicitly support Python, only know Java and .NET apps. But the Python SDK botto has support for sending traces to local demon, which then forwards to the Xray service. And what's nice about Xray, despite it being proprietary and not open tracing compliant, is you're able to configure sampling rates for different URL routes of your application based on either a fixed request per second or a percentage of requests. However, it's not possible to configure these rules with bottom. Also, or almost redeemable, is their visualizations. So while there's the typical waterfall chart, they also have a request flow graph where you can see average latencies, captured traces per minute and requests broken down by response status. So basically, AWS and Xray seems pretty cool and probably the most useful out of all of these, but it'll take some time instrumenting your app and introduces vendor lock-in. And some honorable mentions that do app performance measurement. I don't have personal experience with these, but Datadog and New Relic might be of interest to some of you. All right, and a quick opinionated wrap-up. Got like a minute here. If you run microservices, you should be tracing them. Otherwise, it's very difficult to understand an entire system's performance, anomalous behavior, resource usage, among other many aspects. However, good luck. Whether you choose a self-hosted solution or a provided service, documentation is all around locking. Granted, very young space, very much growing as open tracing standard is developing. And as I mentioned, language support isn't 100% even if it might not even be there. And there's a lack of configuration for relationship tracking or intelligent sampling and available visualizations. But it is indeed an open spec that can be influenced or you might feel so inclined to implement your own, to which I say good luck. And then finally, all of this and some pretty graphs and stuff is up on my blog post if you're interested. Thank you.