 Hi. Thanks very much for joining the distributed tracing and deploys talk. This is the session where we'll be talking about how to use distributed traces to ship quickly and confidently. This is a really short presentation. It's going to be mostly a demo, but we have a couple of slides that are just going to set things up and introduce the concept to feed people. At the very end, in addition to being recorded, there'll also be some time to answer questions. So just by way of introduction, my name is Clay Smith. I'm a partner engineer at LightStep, have been working with continuous integration and continuous deployment for a long time. It's a topic I'm passionate about and excited to kind of talk to you today about how distributed traces fit into deploying software. By way of introduction for LightStep, the company I work for, one slide and two images that we like to show people a lot are the difference between the public or popular perception of modern software and how it actually works. For practitioners and people that are kind of deep into actually building complicated systems, we know from experience that, unfortunately, the reality is much closer to the image on the right side there. In LightStep, in particular, we're a software company that really hopes to help SREs and platform engineers work in optimizing the blog software that's in that environment. And ideally, in doing so, we can help people build software more quickly and more reliably. And so with that, obviously, observability comes up and there's been a lot of marketing around observability and kind of what that means. And there's a lot of different ways to look at it, including the Wikipedia definition, which comes from control theory. But a really simple and straightforward way to look at it too or think of it is what caused that change? And this becomes especially interesting with service deployments, making changes in code and deploying those changes because that's an intentional change that hopefully goes well, but we know from experience that's not always the case. But with complicated systems and distributed systems in particular, getting to the root cause of a deploy went out, what break or what's what broke or what's causing that regression can be really tricky. And so the demo and the talk today is going to be going deeper into that use case in particular. And we know from surveys and talking to a lot of people that the story behind what caused that change after you send out code change to production isn't always perfect. Polls and surveys have shown that this process is more often not can be feared by people and it's often slow and sometimes fragile. And beyond all of that, particularly with distributed systems and teams and different services with different owners, the coordination in a large organization to kind of figure out who broke what can be really complicated by itself. So we think there's obviously a better way to do that. And that's kind of the purpose of this quick session today is getting into that. So we've set up a demo environment that we'll be going into shortly of projects in GitLab. It's a Docker based microservice environment in a mono repo. We set up a GitLab CICD pipeline to deploy that to a Kubernetes cluster. In this case, it's in Google Cloud using GitLab's managed Kubernetes environment. And then we've connected that to LightStep to actually understand what's happening during each deployment and how to understand those changes. And the idea here is all of these together, when you're using all of them, you can get back to the code and fix problems faster and more quickly. And so we're showing that shortly. So I'm going to switch over to a demo to kind of show this in action. I mean, at shops, stop my screen share right now and we'll switch windows. All right. So we're back in the demo environment and we pulled up the Hipster shop repository here in GitLab. So this is the microservice repository I was talking about earlier. It's based on a pretty popular demo environment for Google. But the idea here, we can see in the service diagram, we've got about eight or nine different services that's powering a customer experience. In this case, it's an e-commerce app that sells products for hipsters. There's nothing too unusual about this, but there's two things we've done in this environment that I need to call out before we kind of jump into the observability piece. The first thing is that all of these different services have been instrumented and they emit telemetry in particular traces and we're going to use those traces to kind of get to root cause analysis and understand what changed during deploys. The other thing, in addition to having telemetry and generating traces, is that for each of these services, we've actually made a change in the pipeline. Here is a configuration file for scaffold, which is a tool that manages Kubernetes deploys, but what we've done here is actually injects the version from the GitLab environment variable into the instrumentation code. So when we're running these services in production and those services are emitting telemetry that gets collected and analyzed by LightStep, we know exactly what version was running when we observed those changes. So this is somewhat of a technical note, but just need to call it out that this really enables the full end-to-end workflow. We're able to go from code to the telemetry data we see in LightStep because we've added this tag. So just quick note and different ways to do this. There's some more documentation on the LightStep site, but this is just an important note of what ties it all together. So through the GitLab continuous deployment pipeline, we've deployed to our Kubernetes cluster running in Google Cloud. Here it is. You can buy various hipster products. And in this case, let's say that we've started to get some customer complaints in the morning after we know we've made some changes. So as I said earlier, the key question is, well, what caused that change? What's actually causing the sluice that's causing customer complaints? And so at this point, there's historically been various ways to do that, but I'm going to show what it looks like in LightStep, in particular the service directory page of LightStep. And so this will look familiar if you used metrics dashboards before, but there's a couple of important differences I wanted to highlight too. First of all, you see all of your services so far so good. These are all the microservices in the environment. And in the middle of the page here, we see key operations on that service. So latency errors, operations per second. Because the complaint was around slowness, the idea here is, well, looking at latency, what got slow and when? Because we've tagged or instrumented the code in that environment to include the version number, we have these version tags appear. And what this means is it's the first time that LightStep saw that tag in the production environment. And so what it does, though, is it makes it really easy to see the before and after of a change for that service. So in this case, this already looks super suspicious. In the inventory service, in the update inventory operation, so here's a call that's happening in the inventory service after this deploy, version 1.14.187, latency jumped from a few hundred milliseconds to over a second. So if we click on that, we can pair it to one hour prior. We can have a before and after picture of what happened during that change. And what we just did there in a single click is set up a baseline window right here in blue with a regression window right here in yellow. So we want to understand why it's so much slower in yellow versus blue. And we're going to go through this page and kind of drill down. So immediately we see the histogram of latency on the right-hand side in yellow. It's a lot slower. In blue, it's a lot faster. So what actually changed? Because we're collecting tags with every request through these services, we are able to bubble up almost immediately. So what's in the baseline or what's in the regression? And a couple of things immediately stick out. One, the version is different. So in the regression, version 1.14.187, it got slower. But also, this is kind of interesting. There's a tag that says, large batch equals true. So this is kind of indicating that some code changes behind this and it's not CPU, memory, or like resource exhaustion. So by clicking on that, we're going to kind of narrow our field of focus. We then can see the service diagram. Let's see all services. And something immediately kind of stands out here. In the inventory service, when it writes the cache, we see in yellow here, there's a lot of latency contributing to that. In addition, we studied all the upstream services, the web app, the Android app, the iOS app. So we know it's also having customer-facing impact. If we go ahead and click on the inventory service, we're going to go ahead and just narrow down into the write cache operation on it. And we're going to actually see correlated logs. So again, baseline in blue, regression in yellow. Right here, we see that in the regression, it's writing the cache between 1,300 and 13,000 items. And that's not happening in the baseline window. So we've gone all the way from seeing the latency spike here, understanding the service, the version, and then lastly, going all the way down to the operation. So in a few clicks, we've gone from things seem slow, customers are complaining to the individual correlated log lines that indicate that this is a code change related to the cache write operation. And then in doing so, this is more than enough operation to open a JIRA ticket or open a bug report, roll back the change and go back to your GitLab project, see what happened in that version and make the appropriate change. So that's the entire workflow. And at this point, we'll go back to the demo deck here and answer some few questions and happy to answer those and get into the more detail. Thanks very much.