 Hei, rwy'n ei amser. Rwy'n meddwl am gyfodol yma sy'n ffagorol ac mae'n gweithio y platform o'r ymddun i'r falch i gwellio cyfweld y gallu cyfweld o'r ymddun i gwelliadau. Yn ymddwch, mae'n gweithio'n cael y prosiectio sy'n gallu bod yn wneud ychydig i gyfrifio yma sy'n gweithio'n gweithio'n gweithio. Yn hyn, mae'n gobeithio'n gweithio'r cyfrifio ac mae'n gweithio yma mae'n gweithio eu cyfrifio isdio yn open telemetri. And today I want to share a few key insights that may help you if you're thinking about doing the same. OK, so let's imagine it's 2 AM. You run a web store and suddenly your users can't check out. They're getting those dreaded 500s. Now if you had no tracing at all, maybe you'd default to log some metrics like the old days. Let's say you've started tracing and you've got the web store instrumented but you don't have isdio. Then you start to see something like the left-hand side. So we can see that we've got something going wrong at the web store. We're not entirely sure what. Perhaps you're new to the team and you don't even know that there's a checkout service. Perhaps there are legacy services and you're trying to work out what's going wrong. So let's say we've instrumented the web store. We haven't instrumented everything else but we have all services on the isdio mesh. Then you can immediately see that you've got a checkout service and that there's something going wrong there and perhaps you can focus your investigation there. So that's one area that we found isdio particularly helpful. We've got some big and critical targets like the front door to our web application instrumented and there's dozens of services underneath. Some of them are instrumented, some are not but isdio can help point us in the right direction. So what do you need for this? So one thing you need to understand is unfortunately isdio is not psychic and this will apply to any tracing solution that you're using. Any requests coming into your system as it passes through all the services you're going to need to propagate context. That typically takes form of HTTP headers. There's various formats there. And generally that's going to involve extracting the headers from inbound requests and then injecting them into outbound requests. But frankly developed time is better spent on feature work. So for that we use the open-celemetry auto instrumentation and that automatically propagates context from many common libraries and frameworks. We've instrumented about 80 plus initial services with the Java agent and we've injected it with the operator and we've barely changed a single learner code. But there was one challenge that we encountered with headers in the first phases and that was that some piece of our infrastructure did not support that latest standard that W3C headers. So again, open-celemetry and isdio can help you solve this. The only thing that needs to remain constant is the headers throughout the requests. So the hotel collector can receive trace spans in many different formats, probably the formats that you rely on and we configured isdio to use the zipkin provider which we used those B3 headers and the hotel Java agent and many of the SDKs have a propagator's config option. So you can set that to use B3 headers and they also propagate W3C ones as well. So we did that and then we get that end-to-end request flow but we're sending spans in many different formats. So you can survey your infrastructure. Perhaps you have reliance on some older traces but you can still use open-celemetry and isdio has many formats that you can use as well. So survey your infrastructure, work out which ones are appropriate for you and then go ahead and configure it. Okay, so terms of config, we googled around, we found various different ways to configure isdio tracing but we did settle on the de facto way that you should do it now, which is mesh config and the telemetry API. So in your mesh config, you'll define your extension providers in terms of which one, like open-celemetry zipkin, there's various formats and then telemetry is where you pick which ones will default for each workload. And you can also do things at the namespace of workload level. So perhaps you want to roll out things on a gradual basis, you can do that. Or potentially you want to disable spam reporting for strictly noisy services. So we did that for the hotel collector itself. And then one thing to call out with the random sampling percentage, typically this is parent-based. So if you have any tracing happening before your mesh, isdio is just going to respect that decision. It's not going to change it. So this only really takes effect when isdio is making, isdio receives requests for the first time. And there's no tracing headers there. Okay, so one thing that we'll take a little bit of time to change is user habits. We've got a Rumbock which is pretty much muscle memory to season responders. It says check the logs, it also says check the traces now, but people will default to looking at the logs. And if you can put those trace IDs in the logs, you'll help people discover the traces exist and what's really happening under those requests. So that's quite easy to do with the Envoy access logs and you should also put it into your other logs where possible as well. O-Tail Java agent will help out with that as well. Okay, so quick recap. Key message here is really enable it and then get iterating. If it getting full end-to-end visibility across everything from day one, we'll be quite tough, especially if you've got a lot of services. You can get a head start with the O-Tail auto instrumentation and if you need to make some match tracing providers you can just got to make sure this head is the same throughout. Config, give a go with the telemetry API, it should sort you out. I've linked a tutorial down there which I found recently which is really helpful. So you can give that a go and then don't forget about this other observability signals. So add your trace IDs into logs and you'll help your users really discover what's going on under those requests. Another thing you can do is add exemplars to metrics but unfortunately I don't have time for that today. Thanks for listening and if you have any feedback or questions, then please reach out. Cool, and we have just a few minutes for questions as well. If anybody has any, go ahead and raise your hand and one of the two of us will meet you with a mic. I just want you to share your experience on the sampling rate. So what's the sampling rate you have configured and I'm assuming you are running on some cloud like maybe Azure or AWS. So if you compare their inbuilt distributed tracing system so how you choose whether I should go for the STO comparing the sampling rate and the cost if we configure this and then put all the logs there and then their default logs cost. Okay, yeah. So we're still experimenting with sampling rate. We really want to try out the tail sampling. We've not quite got there yet. I'm hoping to learn a bit more about that this week. So we actually sample our staging environment where developers are testing and where they replicate problems at 100% so that they can definitely get that. And then our production by default is 1% but we do configure that a bit higher on some workloads when we've been trying to debug some problems. We did find that generally there was a bit higher resource you choose as you up the tracing rate but the long and short of it is you need to experiment quite a lot. There's no one to answer for all your workloads and the other thing you have to do is you have to really work out where are the traces are starting and then that's where you'll need to adjust it. So we have some things where like the requests are coming in encrypted so the trace is actually starting at the hotel Java agent rather than Istia. Alrighty, any other? Great, awesome, thank you Chris. Oh, great. So the question is where do you keep your traces? We tried various vendors. Right now we're storing them in Grafana tempo open source. How about retention? For how long do you retain and how do you for example deal with things like GDPR or PIA information in traces? So we generally retain our traces for 14 days and we generally find that's enough time to look back but we do recommend people to export the traces and save the file for postmortem documents and that sort of thing. Alrighty, awesome, thank you Chris.