 I'm sorry, this is a very technical talk in a very sleepy talk slot, so if you fall asleep in the middle, I will be super offended, but I won't call you on it too hard. So yeah, I'm Stella Cotton, if you don't know me. I'm an engineer at Heroku, and today we're going to talk about distributed tracing. So before we get started, a couple of housekeeping notes. I'll tweet out a link to my slides afterwards, so they'll be on the internet. So there'll be some code samples and some links, so you'll be able to check that out if you want to take a closer look. And then I also have a favor. If you have seen me speak before, I have probably asked you this favor. So Ruby Karaoke last night, anybody go? Yeah, totally destroyed my voice. So I'm going to need to drink some water, but otherwise I get really awkward, and I don't like to do that. So to fill the silence, I'm going to ask you to do something that my friend Lillie Shalene came up with, which is each time you take a drink of water, just start clapping and cheering. All right, so we're going to try this out. I'm going to do this, yeah? All right, so hopefully that happens a lot during this talk so that I won't lose my voice. So back to distributed tracing, I work on a tools team at Heroku, and we've been working on implementing distributed tracing for our internal services there. And normally I don't do this whole Brady bunch team thing with the photos, but I just wanted to acknowledge that a lot of the trial and the error and the discovery that went into this talk was really a team effort across my entire team. So the basics of distributed tracing, who knows what distributed tracing is? Okay, okay, cool. Who has it at their company right now? Aw, I see you, Herokai. So if you don't actually know what it is or you're not really sure how you would implement it, that's, you're in the right place. This is the right talk for you. It's basically just the ability to trace a request across distributed system boundaries. And so you might think like, Stella, we're Rails developers, this is not distributed systems conference, this is not Scala or Strangeloop, you should go to those. But really there's this idea of a distributed system, which is just a collection of independent computers that appear to a user to act as a single coherent system. And so if a user loads your website and more than one service does some work to render that request, you actually have a distributed system. And technically, because somebody will definitely will actually me this, if you have a database and a Rails app, that's actually technically a distributed system. But to simplify things, I'm really going to talk more about just the application layer today. So a simple use case for distributed tracing. You run an e-commerce site, you want users to be able to see all of their recent orders. Monolithic architecture, you got one web process or multiple web processes, but they're all running the same kind of code, and they're going to return information. Users, orders, users have many orders, the orders have many items. Very simple Rails app. We authenticate our user, our controller, going to grab all of the orders, all of the items, render it on a page. Not a big deal. Single app, single process. Now we're going to add some more requirements. We got a mobile app or two. So they need authentication. So suddenly it's just a little more complicated. There's a team dedicated authentication, so now we maybe have an authentication service. And they don't care at all about orders. So it makes sense. They don't need to know about your stuff. You don't need to know about theirs. So it could be a separate Rails app on the same server, or it could be on a different server altogether. It's going to keep getting more complicated. Now we want to show recommendations based on past purchases. So the team in charge of this recommendations, bunch of data science folks, they only write Python, bunch of machine learning. So naturally the answer, microservices, obviously. But I mean, seriously, it might be services. So your engineering team, your products grow. You don't have to have this microservices bandwagon to find yourself supporting multiple services. Maybe one is written in a different language. It might have its own infrastructure needs. Like for example, our recommendation engine. And as our web apps and our teams grow larger. These services that you maintain might begin to look less and less like a very consistent garden and just more like a collection of different plants and different kinds of pots. And so where does distributed tracing fit into this big picture? So one day, e-commerce app, you go to your website, it starts loading very, very slowly. And if you're going to look in your application performance monitoring like New Relic or Skylight or use a profiling tool, you can see recommendation services taking a really long time to load. But with these single process monitoring tools, all of the services that you own in your system or that your company owns are going to look just like third party API calls. You're getting as much information about their latency as you would about Stripe or GitHub or whoever you're calling out to. And so from that user's perspective, you know there's 500 extra milliseconds to get the recommendations. But you don't really know why without reaching out to the recommendations team, figuring out what kind of profiling tools they use for Python who knows and digging into their services. And it's just more and more complicated as your system is more and more complicated. And at the end of the day, you cannot tell a coherent macro story about your application by monitoring these individual processes. And if you have ever done any performance work, people are very bad guessers at understanding bottlenecks. So what can we do to increase our visibility into the system and tell that macro level story? Stripe to tracing, that can help. It's a way of commoditizing knowledge. Adrian Cole, who's one of the Zipkin maintainers, he talks about how in increasingly complex systems, you want to give everyone tools to understand this whole system as a whole without having to rely on these experts. So cool, you're on board, I convinced you, you need this, or it at least makes sense. But what might actually be stopping you from implementing this at your company? A few different things that make it tough to go from this theory to the practice with distributed tracing. And first and foremost is that it's kind of outside the Ruby wheelhouse. It's not represented, Ruby is not represented in the ecosystem at large. Most people are working in Go or Java or Python. You're not going to find a lot of sample apps or implementations that are written in Ruby. There's also a lot of domain-specific vocabulary that goes into distributed tracing, so reading through the docs can feel pretty slow. And finally, the most difficult hurdle of all is that the ecosystem is extremely fractured. It's changing constantly because it's about tracing everything everywhere, across frameworks, across languages, and it needs to support everything. So navigating the solutions that are out there and figuring out which ones are right for you is not a trivial task. So we're going to work on how to get past some of these hurdles today. We're going to start by talking about the theory, which will help you get comfortable with the fundamentals. And then we'll cover a checklist for evaluating distributed tracing systems. Yeah. All right. So start with the basics. Black box tracing. The idea of a black box is that you do not know about, and you can't change anything inside your applications. So an example of black box tracing would be capturing and logging all of the traffic that comes in and out at a lower level in your application, like at your TCP layer. All of that data, it goes into a single log, it's aggregated. And then with the power of statistics, you just kind of get to magically understand the behavior of your system based on timestamps. But I'm not going to talk a lot about black box tracing today, because for us at Heroku, it was not a great fit. And it's not a great fit for a lot of companies for a couple of reasons. One is that you need a lot of data to get accuracy based on statistical inference. And because it uses statistical analysis, it can have some delays returning results. But the biggest problem is that in an event driven system, so sidekick or a multi-threaded system, you can't guarantee causality. And what does that mean exactly? So this is sort of an arbitrary code example, but it helps to show that if you have service one, kicks off an async job, and then immediately synchronously calls out to service two. There's no delay in your queue. Your timestamps are going to correlate correctly. Service one, async job, awesome. But if you start getting queuing delays and latency, then a timestamp might actually make it consistently look like your second service is making that call. So white box tracing is a tool that people use to help get around that problem. It assumes that you have an understanding of the system. You can actually change your system. So how do we understand this path that our request makes to our system? We explicitly include information about where it came from using something called metadata propagation, and that is a type of white box tracing. It's just a fancy way of saying that we can change our Rails apps or any kind of app to explicitly pass along information so that you have an explicit trail of how things go. And finally, another benefit of white box tracing is real-time analysis. It can be almost real-time to get results. Very short history, metadata propagation. So the example that everyone talks about when they talk about metadata propagation is Dapper, and the open source library that it inspired called Zipkin. So Dapper Paper is published by Google in 2010, but it's not actually the first distributed systems debugging tool to be built. And so why is Dapper so influential? Well, honestly, it's because in contrast to all of these other systems that came before it, those papers were published pretty early in their development. But Google published this paper after it had been running in production at Google Scale for many, many years. And so they're not only able to say that it's viable at a scale, like Google Scale, but also that it was valuable. And so next comes Zipkin. And that's a project that was started at Twitter during their very first hack week. And their goal was to implement Dapper. And they open sourced it in 2012 and is currently maintained by Adrian Cole, who is not actually a Twitter anymore. He's at Pivotal and he spends most of his time working in the distributed tracing ecosystem. So from here on out, when I use the term distributed tracing, I'm gonna talk about Dapper and Zipkin like systems because white box metadata propagation distributed tracing systems is not quite as a Zipkin. And if you wanna read more about things beyond just metadata propagation, there's a pretty cool paper that gives an overview about tracing distributed systems beyond this. So how do we actually do this? I'm gonna walk us through a few main components that power most systems that are of this caliber. So first is the tracer. It's the instrumentation that you actually install in your application itself. There's the transport component, which takes that data that they collect and sends it over to the distributed tracing collector. That's a separate app that runs. It processes, it stores the data, and it stores it in the storage component. And then finally, there's a UI component that's typically running inside of that that allows you to view your tracing data. So we'll talk first by the level closest to your application itself. That's the tracer. It's how you trace individual requests. And it lives inside your application. In the Ruby world, it's installed as a gem, just like any other performance monitoring agent that would monitor a single process. And a tracer's job is to record data from each system so that we can tell a full story about your request. You can think of the entire story of a single request lifecycle as a trace. This whole system here captured in a single trace. Next vocab word, span. Within a single trace are many spans. It's a chapter in that story. So in this case, our e-commerce app, calling out to the order service and getting a response back, that's a single span. In fact, any discrete piece of work can be captured by a span. It doesn't have to be network requests. So if we want to start mapping out the system, what kind of information are we going to start passing along? So you could start by just doing a request ID so that you know that every single path that this took through. You query your logs and you could see that's all one request. You're going to have the same issue they have with black box tracing. You can't guarantee causality just based on the timestamps. So you need to explicitly create a relationship between each of these components, and a really good way to do this is with a parent child relationship. The first request in the system doesn't have a parent, because somebody's just clicked a button loading a website. So we know that's at the top of the tree. And then when your auth process talks to the e-commerce process, it's going to modify the request headers to pass along just a randomly generated ID as a parent ID. Here it's set to one, but it could really be anything. And it keeps going on and on with each request. So trace is ultimately made up of many of these parent child relationships. And it forms what's called a directed acyclic graph. And by tying all of these things together, we're able to actually not just understand this as an image, but with a data structure. And so we'll actually talk in a few minutes about how the tracer actually accomplishes that in our code. So we've got our relationships. If that's all we wanted to know, we could stop there. But that's not really going to help us in the long term with debugging. Ultimately, we want to know more about timing information. And we can use annotations to make a more rich ecosystem of information around these requests. By explicitly annotating with timestamps, when each of these things recur in the cycle, we can begin to understand latency. And hopefully you're not seeing a second of latency between every event. And these would definitely not be in user-readable timestamps. But this is just an example. So zoom in to our auth process and how it talks to the e-commerce process. So in addition to passing along the trace ID, parent ID, child span, we'll also annotate the request with a tag and a timestamp. And by having our auth app annotate that it's sending the request and our e-commerce app annotate that it received the request, this will actually give you the network latency between the two. So if you see a lot of requests queuing up, you would see that time go up. And on the other hand, you can compare two timestamps between the server receiving and the server sending back the information. And you would be able to see if your app is getting very slow, you'll see latency increase between those two things. And then finally, you're able to close out that full cycle by indicating that the client has received that final request. Let's talk about what happens to that data. Each process is gonna send information via the transport layer to a separate application that's gonna aggregate that data and do a bunch of stuff to it. So how does that process not add latency to your system? First, it's only gonna propagate those IDs end band by adding information to your headers. Then it's gonna gather that data and just report it out of band to a collector. And that's what actually does the processing in the storing. For example, Zipkin is gonna use Sucker Punch to make a threaded async call out to the Zipkin server. And this is gonna be similar to things that you would see in metrics like Librado, any of your logging and metric systems that use threads. So our data collected by the tracer, transported via the transportation layer, collected, finally ready to be viewed in the UI. So this graph that we're viewing here is a good way to understand how the request travels, but it's not actually good at helping us understand latency. Or even understand the relationship between calls within systems. So we're gonna use Gantt charts or Swimlanes instead. So the opentracing.io documentation has a request tree similar to ours. And looking at it in this format, you'll actually be able to see each of the different services in the same way that we did before. But now we're able to better visualize how much time is spent in each sub-request. And how much time that takes relative to the other requests. You can also, like I mentioned earlier, instrument and visualize internal traces that are happening inside a service, not just service to service communication. Here you can see billing service is being blocked by the authorization service. You can also see that we have a threaded or parallel job execution inside the resource allocation service. And if there started to be a widening gap between these two adjacent services, it could mean that there's network request queuing. I still can't help myself, but like saying and do a little dance when I do that. All right, we know what we want, how are we gonna get it done? So at the minimum, we wanna record information when a request comes in. And when a request goes out. How do we do that programmatically in Ruby? Usually with the power of rack metalware. If you're running a Ruby app, the odds are that you are also running a rack app. It's a common interface between for servers and applications to talk to each other. Sinatra Rails both use it. It serves as a single entry and exit point for client requests that are coming in the system. The powerful thing about rack is that it's very easy to add metalware. So that can sit between your server and your application and allow you to customize these requests. Basic rack app, if you're not familiar with it, Ruby object. It's gonna respond, call, takes one argument. And in the end returns status headers body. That's the basic of the rack app. And under the hood, Rails and Sinatra are doing this. And the middleware format is a very similar structure. It's going to accept an app. Could be your app itself or another set of middleware. Respond to call, needs to call, app.call at the end so it keeps following down the tree. And at the end return the response. So if we wanted to do some tracing inside of our middleware, what might that method look like? So like we talked about earlier, we're gonna wanna start a new span on every request. It's gonna record that it received the request with a server received annotation like we talked about earlier. It's gonna yield to our rack app to make sure that it executes in the next step in the chain and is actually gonna run your code. And then it returns back that the server has sent information back to the client. So this is just pseudo code. This is not actually a running tracer. But Zipkin has a really great implementation that you can check out online. So then we could just tell our application, use our middleware to instrument our request. And you're never gonna wanna sample every single request that comes in because that is crazy and overkill when you have a lot of traffic. So tracing solutions will typically ask you to configure a sample rate. We got our request coming in. But in order to generate that big relationship tree that we saw earlier, we're also gonna need to continue to record information when our request leaves our system. So these can be request external APIs like Stripe, GitHub, whatever. But if you control that next service that it's talking to, you can keep building up this chain. And we can do that with more middleware. If you use an HTTP client that supports middleware like Faraday or Xcon, you can easily incorporate tracing into the client. I'll use Faraday as an example because it has a pretty similar pattern to rack. So match the method signature just like we did with rack. And honestly Faraday's is very similar to rack. If you're using like Xcon, it's gonna look a little bit different. But this is just an example. So pass on our HTTP client app, we'll do some tracing and keep calling down the chain, it's pretty similar. But the tracing itself is gonna be a little bit different. So we're actually gonna need to manipulate the headers to pass along some tracing information. That way if we're calling out to an external service like Stripe, they're gonna completely ignore these headers because they don't know what they are. But if you're actually calling to another service that's in your purview, you'll be able to see further down. So each of these colors, it's gonna represent an instrumented application. So we wanna record that we're starting a client request, and ensure that we're receiving client requests. Add in the middle where just like we did with rack, it's pretty easy. You can even do it programmatically, like automatically for all your requests for some of your HTTP clients. So we've got some of the basics for how distributed tracing is implemented. Let's talk about how to even choose in this ecosystem what system is right for you. So the first question is, how are you gonna get this working? I'm gonna give a caveat that this ecosystem is ever changing. So this information could actually be incomplete right now, and it could be obsolete, especially if you are watching this at home on the web. But let's talk about whether or not you should buy a system. Yes, if the math works out for you, it's kind of hard for me to really say whether you should do that. If your resourcing is limited, and you can find a solution that works for you, and it's not too expensive, probably, unless you're running a super complex system. Light step, trace view, examples that offer Ruby support. Your APM provider might actually have it too. Adopting an open source solution is another option for us, the paid solutions just didn't work. So if you have people on your team who are comfortable with the underlying framework, and you have some capacity for managing infrastructure, then this really could work for you. So for us, for a small team, we're just four people, four engineers. We got Zipkin up and running in a couple of months, while also doing a million other things, but partially because we were able to leverage Heroku to make the infrastructure components pretty easy. And if you wanna use a fully open source solution with Ruby, Zipkin is pretty much your only option as far as I know. So you may have heard of open tracing, you might be like Stella, what about this open tracing thing? That seems cool. A common misunderstanding is that open tracing is not actually a tracing implementation, it is an API. So its job is to just standardize the instrumentation, like we kinda walked through before, so that all of the tracing providers that conform to the API are interchangeable on your app side. So if you wanna switch from an open source provider to a paid provider or a price versa, you don't need to re-instrument each and every service that you maintain. Because in theory, they're all being good citizens, they're conforming to this API that is all consistent. So where is open tracing at today? They did publish a Ruby API guidelines back in January, but only LightStep, which is a product in private beta, has actually implemented a tracer that conforms to that API. So if a tracer, existing implementations like Zipkin, they're gonna need to have a bridge between the tracing implementation that they have today and the open tracing API. And the other thing that is just not clear still is interoperability. So for example, if you have a Ruby app, Open Tracing API, everything's great and you have this paid provider that doesn't support go. You can't necessarily use two providers that use open tracing and still send them to the same collection system. So it's really only at that app level. Another thing to keep in mind is that for both open source and hosted solutions, Ruby support means a really wide range of things. At the minimum, it means that you can start and end a trace in your Ruby app, which is good. But you might have to still write all of your own rack middleware, your HTTP library middleware. It's not a deal breaker. We ended up having to do that for Xcon for Zipkin, but it may be an engineering time commitment that you are not prepared to make. And then unfortunately, because this is tracing everywhere, you're gonna need to rinse and repeat for every language that your company supports. So you're gonna have to walk through all of these thoughts and these guidelines for go or for JavaScript or for any other language. So some big companies find that with the custom nature of their infrastructure, they're gonna need to build out some or all of the elements in-house. Etsy, obviously Google, they're running fully custom infrastructure. But other companies are actually building custom components that are tapping into open source solutions. So Pinterest, Pinterest is just an open source add on to Zipkin, similar to Yelp. So if you're really curious about what other companies are doing large and small, Jonathan Mace at Brown University published a snapshot of 26 companies and what they're doing. It is already out of date, like one of those things is already wrong, even though it was literally published a month ago. So 15 are using Zipkin, nine are using custom internal solutions. But yeah, most people are actually using Zipkin. So another component about this is what are you running in house? What is your team or your ops team? What do they want to run in-house and are there any restrictions? There's this dependency matrix of the tracer and the transport layer, which need to be compatible with each one of your services. So JavaScript, Go, Ruby, and so both the tracer and the transport layer need to be compatible across the board. So for example, for us, HTTP and JSON is totally fine for a transport layer. We just literally call out with web requests to our Zipkin collector. But if you have a ton of data and you need to use something like Kafka, you might think that's cool and then it's totally supported. But if you look at the documentation, it's gonna say Ruby and then you're gonna be like wait, no, if I dig in four layers deep into this documentation, it's only JRuby. So that's like a total gotcha and so for each of these, you really should just build a spreadsheet because it's pretty challenging to make sure you're covering everything. The collection and the storage layers, those don't have to be, those aren't really related to the services that you run. But they might not be the kind of apps that you're used to running. So for example, Zipkin is a Java app, which is totally different from the apps that my team runs. Another thing you need to figure out is whether or not you need to run a separate agent on the host machine itself. So for some solutions, and this is why we had to exclude a lot of them, you actually need to install an agent on each host for each service that you run. And because we run Heroku on Heroku, if we can, we can't really do that, because we can't just give root level privileges to an agent that's running on a dyno. Another thing to consider is authentication and authorization. Who can see and submit data to your tracing system? For us, Zipkin was missing both of those components and it makes sense because it really needs to be everything for everybody. And so also adding on authentication and authorization on top of that for every single company to use that open source library is not really reasonable. So you can run it inside of EPN without authentication. The other option is using a reverse proxy, which is what we ended up doing. So we used two build packs, apt, and then run it build pack. And so we are able to get nginx on our Heroku slug, which is just like a bundle of dependencies and your code with apt. And it's just a package manager for Linux. So we can download and install a specific version of nginx to run as a reverse proxy. Run it allows us to run our Zipkin application and nginx alongside each other in the same host. And we didn't want anybody on the internet to just be able to send data to Zipkin. Like if you just suddenly started sending data to our Zipkin in an instance, that would be pretty weird. So we wanted to make sure we're only having Heroku applications interacting with it. And so we decided to use basic authorization for that. We used HT password to set some just team-based credentials in a flat file, because we only had about 25 different basic auth configurations that we thought we'd be using. And it ends up looking like this from an architecture diagram standpoint. The client makes a request, nginx is going to intercept that, check against the basic auth and make sure it's valid. And then if it is just forwarded along to Zipkin, otherwise it returns an error. And so adding authentication on the client side itself was as easy as going back to that rack middleware file and updating our host name with both basic auth. So that was a really good solution for us. We also didn't want any of y'all to be able to see our Zipkin stuff on the internet, which right now, if you just run a Zipkin instance, there is nothing to keep you from seeing anybody if there's no authorization. So we use Bitly that has an OAuth2 proxy, which is super awesome. It allows us to restrict access to only people with Heroku.com email addresses. And so if you're on a browser and you try to access our Zipkin instance, we're going to check to see if you're authorized. Otherwise this OAuth2 proxy is going to handle the full authentication. So it's configurable with different low balancers slash reverse proxies and OAuth providers. So it's actually really cool if you need to run any kind of OAuth in front of a process. But even if you're going the hosted route and you don't need to handle any of this infrastructure, you're going to need to ask about how are you going to get access to people who need it. Because you don't want to be the team who has to manage this handoff of like sign-ons and sign-ins and oh, you need to email this person. You don't want to manage all that. So just make sure it's clear with your hosted provider how you're going to manage access security. If you have sensitive data in your systems, which a lot of people do, there are two places specifically where we had to really keep an eye out for security issues. One is custom instrumentation. So for example, my team, the tools team added some custom internal tracing of our own services using prepend to trace all of our postgres calls. And so when we, like we did with the middleware earlier, you just, we're wrapping that behavior with tracing. But the problem here is if you're calling SQL.2S and that SQL statement has any kind of private data, you want to make sure that you're not just storing that blindly into your system, especially if you have PII, any kind of like security compliance information that you're storing. And the second thing is that you need to talk through before it happens what to do when your data leaks. For us, running our own system is a benefit because if we accidentally delete data or leak data into a third-party provider, it's easier for us to validate, sorry, it's easier for us to validate when we own that data that we've wiped that data than having to coordinate with a third-party provider. It doesn't mean you shouldn't use a third-party solution, but you should ask them ahead of time. What do you do when data leaks? What's the turnaround? How can we verify it? You don't want to do that when you're in the middle of a crisis. The last thing to consider is the people part. Is everybody on board for this? The nature of the distributed tracing is that the work is distributed. Your job is probably not going to end when you just get the service up and running. You're gonna actually need two instrument apps. And there's a lot of cognitive load, as you can see from the 30 minutes we've talked about this, into understanding how distributed tracing works. So set yourself up for success ahead of time by getting it on Teams roadmaps, if you can. Otherwise, start opening PRs, is the other option. Even then, you're probably gonna be able to need to talk through what it is and why you're adding it. But it's a lot easier when you can show them code and how it actually interacts with their system. So here's the full checklist for evaluation. We'll cover one last thing before I let y'all go. If you're thinking this is so much information, where do I even go next from here? My advice is if you have some free time at work with a 20% time or hack week, start by trying to get Docker Zipkin up and running. Even if you don't plan to use Zipkin at all, it includes a test version of Cassandra built in. So you just need to get the Java app itself up and running. And you don't have to worry about all of these different components right off the bat, if you're just instrumenting Ruby apps, then Zipkin is compatible. You can even deploy this onto Heroku. So once you're able to just get this deployed, the UI loaded, just instrument one single app. Even if the only thing that app does is make a third party stripe call, it'll help you turn some of these really abstract concepts into concrete concepts. So that's all I got today, folks. If you have any questions, I'm actually heading straight to the Heroku booth after this in the big expo hall. So stop by, I'll be there for about an hour, come say, come ask me any questions or talk about Heroku or get some stickers. So, see ya.