 Good afternoon everyone. I hope it all caffeinates a weekend and ready for more content. Enabling effective observability by making high-quality portable telemetry ubiquitous. This is open telemetry's mission statement, but what does effective observability mean to you as a service owner? You're probably thinking about the metrics, the logs and the traces that you export into one or many observability platforms and they allow you to answer questions like is your service functioning as you expected and when it's not Why is your service not functioning as you expected? But the reality is that your service is normally deployed in a complex distributed system What you see here, for example, it's a representation of a service mesh as guys scanner in each of those dots It's a service each of those lines is a dependency between those services and your services deployed here So how do you know how your changes affects that whole system? How do you know how other changes that affect your service most importantly? How do you understand what your users are actually receiving as a as a user experience when there is a Whole Internet in between you and those end users for that. You can't just look at your own service You need to look at the whole system holistically and to be able to do that You need more context you need a bigger context more than your own service and this is what we're going to be talking about in this talk My name is Daniel Gomez Blanco I'm a principal engineer at sky scanner and I've been leading Observability there since 2020 with a focus on adopting open telemetry open standards and simplifying our observability infrastructure I originally joined in 2018 to work on client side performance and then quickly moved to Kubernetes resource optimization. So like To look two sites of a stack I've been a platform engineer for the last 13 years in organizations from the very small to the to the very big and Since last year I've been elected as part of a governance committee from open telemetry and Also published a book that talks about some of the topics. They'll be discussing in this talk So going back to the Concept of context and debugging something with context or without context without context you've got intuition Right, you're thinking about how something that failed before it could be failing again now and It's normally driven by run books. So hopefully you've got some Client site monitoring some run data that is telling you how your users are Experiencing a particular regression and that will tell you to follow a run book and then go check perhaps a back-end service In a metric that is maybe counting the number of five hundred's custom metric And you may start to see a correlation that manually you've got some intuition that it could be related now as a service owner if you're a back-end and service owner you then follow that run book in the middle of the night and You're basically trying to find that the logs that correlate to that to that regression And you do that manually you follow like some perhaps some quiddies that you ran before and Then you may also that run book may also tell you to go and look at particular dashboards a panel for like memory Consumption maybe that is correlated. So you do all that manually when you don't find the actual root cause you then maybe Get someone else out of bed and then you you start to look into perhaps another service that did not get alerted for that But it's one of your dependencies. So we're looking at the same system, but we're looking at three Independent independent views of a service three isolated views. We've got the client side We've got our service and we've got someone else's service, right? So this is all built by intuition. We've seen something before it may happen again And what that leads to is quite a lot of finger pointing You've got your client your front-end team and your back-end team and your platform team all thinking that you know How do we actually connect these dots? Now we're thinking about context and about standards to define and to Process and to transport telemetry data then we've got evidence We're no longer that base in that in intuition. We've got evidence. So that same client that same end user that is experienced in our regression They'll be propagated an open telemetry trace context in a standard way and they'll be propagating They'll be exporting the traces and their spans in a particular way So that allows you to identify in one single view how that end user experience is correlating to the back-end performance You're able to see in one go if your problem is in The one dependency or another but with context and with correlation we can start to all talk about the same language I'll work with the same context, right? So we've got one standard metric for for HTTP metrics and that can correlate via exemplars To that particular trace so the front-end team and the back-end team are all talking the same language And they're able to see from exemplars Individual traces that were sampled at the time that the that a particular data point was being was being recorded Now it works the other way as well So you can correlate from traces and then use that the semantic conventions that we've caught around resource Information to be able to say if a particular operation for example was taken longer than expected if there are any System metrics any sort of like memory utilization metrics for example that would correlate to that particular experience And you can even correlate using that same trace context and those can those conventions correlate to Other telemetry data that perhaps you didn't even instrument with open telemetry For example logs you can correlate trace ID span ID Send that together we send that this morning in order to talk here You can correlate those errors in context again back to the into the traces and Soon as well will be able to correlate profiles So you can also start to look deep down into the code level Performance that was delivering that back user experience So now you're thinking This is a lot of data and it could get expensive And the reality is that most telemetry data that we gather for debugging purposes It's actually never used it's not that interesting if it corresponds to Successful transactions or the ones that completed in a in a duration and a matter of time that is not like It's not actually Interesting from the point of view of debugging or optimizing our systems and open telemetry does give you the tools to be able to to tackle this One of the most important ones is trace sampling So if you've not heard about it before there are two main types of trace sampling head sampling and tail sampling And that allows us to keep the most useful data the most important data and then discard the rest now in head Sampling that decision of what traces to keep it's done at span creation time So you need to use the information that you've got at that time to decide if you want to keep a particular span or if you want to discard it Now this is normally done pro in a probabilistic way So you take a trace ID you run it through an algorithm for example, and that gives you Well, let's say that a decision to keep it or discard it So let's say for example that you want to keep 20 percent for your traces the power of trace context and that and the The the contents that you get with tracing is that you can propagate that decision to your dependencies and to child and to your children span so That service that decided to sample a trace can propagate that decision and a and a dependency can then Honor that decision and also sample that trace so what you end up with is a Complete view of the transaction as it went through the whole system Rather than completely disjointed views that you would get if you were to sample 20 percent with different algorithms, right? so this is a simpler form of tracing of Sampling in a way it allows you to to sample traces that for a particular percentage for example And it's easy to configure easy to maintain and doesn't require extra resources because that decision is done Well, when you when you create the span But it's not It's not so powerful. It doesn't allow you to look at the whole trace with tail sampling on the other hand We get to look at the whole trace we get to Make a decision so we when we receive the first span for a trace and a one single central point We then start to buffer those spans and at some point in time with a go in Decide if we want to keep the trace or not now that is Quite powerful because that allows us to then will sample the traces that are the most Important to us for example, you can sample any trace that contains an error in any of the services or Traces that are the slowest ones or the ones that go over a particular threshold for latency But it is more the downside is that it's more complex to operate It requires that all your spans go through one single point now open telemetry collectors give you the tooling that you need to be able to route all your spans to a particular replica and then to Make those decisions those sampling decisions. They're also vendor-specific Fetch features that provides that as a service and using tail sampling It's our sky scanner. We tend to keep around 5% of all the millions of spans and traces that we produce every minute Which is quite powerful. We've seen teams move from logging From very verbose logging and then turn off that verbose login move to tracing move to with tail sampling And then save about 80 to 90 percent of the telemetry costs another way to Think about the return on investment on your day on your telemetry data is to use each signal for that its intended purpose We've got metrics with got traces We've got logs for now and then If you think about metrics, for example, that provides you that stable signal, right? You you produce telemetry data from metrics that are frequent interval and It doesn't matter if your particular replica is receiving 5,000 or 500 requests per second You are going to produce a sort of stable amount of telemetry data from that from those metrics Now this lower volume of telemetry data actually allows you to make Exports perhaps more resilient you can retry your pushes or your pools of metric data in a more resilient way that you can perhaps do with more Voluminous data like traces or logs or spans or logs So now it's important to keep the cardinality low when you're like thinking about metrics and then correlate that to metric to traces for that extra kind of reality that you need it's quite easy for Developers to add unbound cardinality attributes to their metrics and then DDoS their own their own Infrastructure basically with that explosion in cardinality, right? So if we're thinking about metrics and low cardinality to drive Alerting and to drive long-term trend analysis, then we can go into traces for other examples as I said with exemplars and with Semantic conventions we can correlate to those traces that gives us the high granularity that context the backbone of correlation For all the debugging optimization short-term short-term analysis and all the good things that we get from tracing Now it's a bit more expensive to queue and retry if you're like producing Gigabytes of data per minute, but it is but it does give us that backbone of correlation It's also possible as well with open telemetry and with collectors to be able to generate metrics from spans as well So it's not just you need to always need to decide and ahead Now then we've got logs you're thinking about logs as like they are high volume their low context They're not the best return on investment But they do have the use cases for background tasks start up shut down legacy libraries as well That may not be instrument instrumented with tracing But in which you can inject that trace context and then Correlate it with your with your traces But just structure the logs otherwise they're pretty much useless right or use open telemetry appenders And you get a lot of these things out of the box as well the resource information that correlation with with trace data and Then logs as well of course they are the back in data model for the events API that will allow us to produce things like events for Infrastructure events for example that are not related to application logs Another way to control the data production. Let's say from your applications as the concept of metric views If you never use metric views before They are a powerful way to define the resulting metric streams from your application for those that are coming from open sensors You'll be familiar with this, but it's a way of the the metrics API gives us a way to decouple our Measurement from how those measurements are actually aggregated. So what that results in is an ability for us to inform how the SDK should Aggregate those measurements and change if we want to change something from how the original Instrumentation author intended that metric to be produced in the right hand side For example, we've seen a we're seeing a an example of a view that is Configured to take one of the auto instrumented like metrics Which is request duration by default. That's a histogram But maybe I'm interested only in getting a sum of all the requests for each of the routes that hit my application Right and we can do that at runtime We can then configure that without having to change any code and without having to change any of the API layer of how That is Instrumented and that is quite powerful because if you use an instrumentation libraries That will allow you to control the the metrics that come out of that and also the more and more that we see in open telemetry integrated natively into Into other libraries as well, you'll be able to see that to control the metric production for for your use case Okay, so As an observability engineer, you're thinking this is amazing. How did you get your company? Involved how did you get people to change their mindset, right? There's two way there's two areas to cover here. The first one is how do we communicate value to leadership and There are two avenues to explore here The first one is the avenue of simplification of future proven, right? So open telemetry gives you that decoupling of the API layer from the SDK and that API is a future-proof stable API you can rely on it for stability Now many companies like for example a sky scanner We used to have our abstraction our own abstraction that we had to maintain to be able to decouple ourselves many implementation details Now that's what open telemetry provides now Out of the box Now let's decouple from the SDK and then you can use all this standard tooling and all the efficient to learn like open Telemetry collectors for the transport and processing of that telemetry data So that allows you to simplify that layer of processing and transport Pipelines into open telemetry collectors to have a standard way of handling this and then being able to connect to Observability vendors or open source solutions that you're running in-house now open telemetry collectors as well Well, I'd like you to ease that migration towards Open standards, but also ease the migration towards vendor or a particular open source platform And there's more and more integration of vendors and open source to limb with open telemetry as the as the default standard for semantic conventions, so you get that as well now the other Area is to correlate operational health or product health to business outcomes, right? So if you're in a SLO journey, if you've started your SLO journey, you're thinking about how to communicate the value of SLOs It's really important to be able to measure SLI service levels service level indicators as close as possible to the user Experience, right? That's your client which could be another service or it could be an end user now with open telemetry We can do that we can basically start to measure things as they matter to us And that allows you to correlate that to business KPIs perhaps and see how our regression on one affects the other now All the things that we talked about in terms of context Propagating context allows you as well to Understand how a target that you said on a service will have an effect on the dependency chain, right? So if you understand that dependency chain, then you start to be able to set realistic and sustainable SLO targets and that in the end basically ends up being Meaningful reliability for your users because you're able to relate that to your business outcomes so When something else fails you're able to then use all the power of observability to be able to Reduce your time to resolve and in general make everyone else happy including the business people Now it's easy to convince leadership sometimes But a more difficult is to convince some engineers in in other teams Because not everyone cares about observability so one of the things that is Has works at least in in our case is to achieve cross-organization alignment and there are two groups of engineers that are to me that are critical in this path One is observability engineers. They're normally part of a single team. They are your observability experts your telemetry experts. They're in charge of Implementing those company-wide standards like for example Deciding what protocols to use What and SDK config for the open telemetry config or use a open telemetry distro You want to maybe create your own distro and they are the ones normally in charge of maintaining the infrastructure if you're running Your back end as well in charge of maintaining that and then making sure that the path of least resistance Is the golden path for everyone to follow so making it super easy to be able to adopt the best practices There's another group that I like to call observability ambassadors But you can name him by other names of celebrity champions working across the organization to deliver those best practices now that work in adoption is as important as the work on enablement and the crucial because they understand the domain in which they operate but they also Understand the usage of the API. They don't need to be telemetry experts They don't need to understand perhaps how their telemetry is then passed down to collectors and and all that but they understand the API layer and They understand the instrumentation packages that they use So this group of people are basically in charge of adopting those Standards to their domain and You get both of them together Like the observability engineers and the observability ambassadors and they can work together to deliver that cross-organization alignment Now one of the things that is important to understand as well is that we all learn by doing not seeing so you want to make it fun Make it fun to actually understand the value of observability from an engineering perspective The best way to do that is to train with hands-on exercises hands-on labs Because debugging is a skill and it can be trained as well And if people have never used tracing before there's no reason for that for them to start using it again in the first place so and one of the things that Has worked quite well in our case a sky scanner has been gamifying root cause analysis We recently published a blog post on the open telemetry website that talks about this in more detail But we're taking the wheel of misfortune that was popularized by Google and then using the open telemetry demo to simulate incidents in a system instrumented with open telemetry now that allows us to then take those teams put several teams in the same observability game day and Let them play against each other see who actually debugs up an issue the fastest, right? so that allows them to understand open telemetry concepts also the platform in which they work and In my experience The teams that use context are normally the ones that that win So if you take away points the first one is to use context over intuition You can no longer pretend if you're working a complex distributed system that your Debugging and monitoring practices of five years or ten years ago will work in in our complex systems of nowadays So if you've got context, you've got evidence If you've got no context, you just have whatever like happened before may happen again sort of intuition Use the right tool for the right job We've got multiple signals in open telemetry for a reason and each of them has their pros and their cons understand them making sure that they'll Work as suspected and also correlate between between these signals and understand that observability is a cross-functional discipline observability is not like something that only observability engineers care about you need to make that adoption or best practices roll out across your organization from engineers all the way down to Business analysts if they want to understand the implications with user experience because that is actually what makes observability effective So That's it for for now. I think that we've got some time for questions Otherwise you can catch me at the open telemetry observatory in the export hall from tomorrow And then as well we've got this feedback for feedback form if you want to leave any feedback in this session. Thank you