 Well, everyone, welcome. I'm John Howard. I'm a software engineer at Google. It's working on Istio for about five years now. And I'm going to be talking about tracing CICD with open telemetry. And if you don't understand what those words mean, hopefully I will cover those in the next 10 minutes. I only have 10 minutes, so things are going to be pretty quick, but we're going to cover why observability into a CICD pipeline is important, how tracing can help there, and then actual concrete steps that you need to take to go achieve this in your infrastructure. You may notice that actually, I haven't said Istio yet, and this is IstioCon. The reason this is related to IstioCon is for two reasons. One is that Istio has actually done these in our own CSD so that we got visibility into our CICD pipelines. And the other reason is that Istio itself can help you use tracing for your own applications. So this is kind of an indirect way to learn more about tracing that you can apply to your CICD pipelines, but you can also apply to your microservices using Istio. So it's kind of indirectly related at best. Okay, so hopefully everyone here that is working on some code has some form of CICD. May not be the most sophisticated, but at some point you are building your code, testing your code. This may be locally on your machine or it may be in some Cloud Runner, like I think this is a picture of GitHub Actions, just going through some flows of building checking things out, installing it, running tests, whatever. Now the dream here is that this is fast, right? If I can run my test suite in one millisecond, I can run it all the time. I can run it every time a file changes, right? I don't even have to think about running tests. I just run them in a no pass fail instantly. If they take an hour on the other hand, now it's like I run the test before I go to lunch and I come back and then get a coffee and sit around for a little bit and finally get a result, right? These are wildly different. So whether it's locally or in a kind of pull request flow, the faster we can get our CSCD pipelines, the faster we can iterate and the better experience. Now the issue is that in order to make something faster, we need to understand why it's slow so we can see where it's spending time and where we can improve. Otherwise we're going to be wasting a lot of effort optimizing things that don't really matter. So the current state of CSCD observability at least in open source projects on GitHub that I'm familiar with is pretty terrible. You have a few options. You can get the list view or have a list of tests. They passed or failed. That's kind of useless for optimizing them, right? It's great to know the pass, but I don't know how long they took. I don't know why they took that long. If you don't like that, you can get a grid view. This one is also not very useful for the same reasons. If you don't like a grid, you can get a graph. I think this is good lab maybe. Again, I can see what's going on, but I can't optimize it because I don't know where it's spending time, right? So none of this is particularly useful for me if I want to optimize my CSED. I don't really have the observability. I can get kind of surface level stuff, but nothing in depth. So that's where tracing comes in. Tracing is really this thing that's mostly associated with kind of microservice architecture where it's often called distributed tracing, right? And the idea is that in a single model, that's sure you can have logs or metrics and whatnot, but with the microservice, everything's distributed and we need to have a way to join these together so we understand what's going on in the system. So this shows kind of a fake example of a trace that is a client calling some API and under the hood, the API is calling a bunch of other services, some of which also call other services. So we have some gateway that calls a database and we can see all these things together, how long each action took, what the flow of dependencies is, et cetera. So this is a very rich view of data. In Istio, we support this as well. So here's an example trace of a real-world service, the product page book info going through Ingress Gateway and we can see where time is spent on each microservice call. So this I think is one of the most powerful tools in the kind of observability toolkit and I want to apply that to CI CD. So I'm gonna start at the end where we've actually done this in Istio and explain some of the benefits and then I'll talk about how to get there. So here's an example trace of one of our end-to-end test jobs. I've collapsed a lot of the spans because there's actually 10,000 of them because we go really, really fine in detail. So this just gives kind of an overview. So you can see here, for example, we get a lot of understanding of what's going on in the test, right? We can see this test takes five minutes, almost six minutes and a minute of that's in set up kind cluster, kind of Kubernetes and Docker, so that's where we run our tests. That's kind of slow. Maybe we could focus on there for optimization. I will note that this is after a lot of optimization. So I'm gonna show some traces from before. We optimize that show some better examples. We can also see each test in purple here has a lot of set up time. 40 seconds here and then a minute, five seconds later. That's another area that we could see is maybe potentially wasted time. We could see as well, maybe this build images task could be run in parallel to the set up kind cluster. So there's a lot of info we get just immediately and this is just the high level view, right? We also have low level view of each actual HTTP call made in each request and we make thousands of HTTP calls in our new tests. So that was showing again, the optimized view which makes it look a little bit less interesting because there's no obvious areas to improve because we've already tackled a lot of the low hanging fruit. But here's an example of this is kind of simplified we basically run just go test and dot, dot, dot. And we have a span for the entire test run takes about 10 minutes. And you can see we tested two packages, the security and the pilot package which each took about two minutes. So the question is what is going on in this five minute gap where nothing's happening, right? And that's something that was, this is from our real test infrastructure. We had this for many, many years and we didn't notice until we turned on tracing and it immediately stood out like a sore thumb. Took 10 seconds to recognize this once we had tracing and we spent years not knowing it for a while. Now the actual reason why this happens is way beyond this time and the time period but I have a massive 25 page blog post that goes into depth onto this problem and many other problems. So if you're interested, feel free to check it out. Now I think the biggest counter argument to using tracing for CACD is that like, well, I have logs, right? The logs are good enough. And I would argue that the logs are actually not sufficient. If you go look at the logs from any Go test and I'm familiar with Go, it could probably apply to some other languages as well. I promise you, you will get logs that look like your tests are running in sequence A, B, C, D, whatever of that order of your packages, right? And you'll get a view that looks like this. Now that's not actually what's happening. In reality, a lot of the packages are running in parallel. You may have something that looks more like this and they may actually not be running in order at all. See, one thing that Go does is it sequences the order that it prints out the logs. That's not the order that they're running. It's batching up all the output of the logs and running them out in order so that it looks like they're in order. It's completely lying to you, right? Now you may also get something more interesting than just seeing kind of, oh, it's sure they're running parallel but that doesn't really help much. You may find, for example, that we have a huge gap between actual packages running, for example. This is not hypothetical. This is a real world trace shown here from, again, East Joe's Test Suite. And we can see this package takes about 50 seconds, this long blue one. Now the real interesting part though, this package has zero tests. So how can a package that has zero tests take 50 seconds to run, right? It is a mystery that's also discussed in my blog post. That's not really the point of this. The point is that if you look at the logs, it says the test executes in zero seconds. And if you look at the traces, you'll find that entire thread is blocked for 50 seconds on this test. So again, logs are helpful even with tracing but they're not sufficient on their own in opinion. So now hopefully you're convinced you want tracing. I'm gonna give kind of a lightning fast introduction to how we get there. So the first thing you need to do is instrument your code. If you have a function that says called build, for example, all we need to do to do tracing is kind of a pretty minimal. We start a span, which is kind of one unit of tracing and then eventually we end the span and we propagate that. OpenTelemetry has docs on how to do this with every single language out there pretty much. So go check that out for more information. The next thing we need to do is propagate context. So in order to link up the spans, we need to join kind of each parent needs to tell the child what span but trace they should be a part of. Traditionally, this is done through HTTP headers because we're talking about microservices calling each other over HTTP. In CICD, you may also use HTTP headers but oftentimes we have different processes calling each other, maybe some shell script for executing some go test and then go build and then Docker push, whatever. So in CICD, it's often done through this environment variable instead. Same form, way to convey it. This is not technically part of the OpenTelemetry standard but it's becoming a de facto standard or frost a bunch of CICD usages. Last thing, I mentioned that OpenTelemetry has info on instrumenting a bunch of languages. CICD uses bash pretty extensively in my experience. You can also do tracing with bash. It's pretty simple. Here's a blog post that shows a bit more about how to do that. Finally, you may want to actually instrument the CICD platform itself. You can get a lot of useful info like finding out, hey, the git clone in my repo is taking two minutes. What's going on? Like that's a lot of time wasted or other things of that nature. So you get the full picture. You probably also want support in your CICD platform. Now most people probably can't go modify the CICD platform but you can at least open an issue. As I've been here for proud, proud is our CICD provider. A lot of the CICD providers out there like BuildKite and GitLab I think are adding or have added tracing. So you may not need to do this step. And that's it. Thanks everyone for coming to my talk. I hope that you've seen the value of adding tracing for CICD and are able to use it in your environment. If you have any questions, feel free to ask them in the chat or find me on Slack. Thanks.