 Hello everybody and thank you for joining me at my session today. I know it can be overwhelming the amount of content out there and especially trying to consume all that virtually is very challenging. So I appreciate you taking the time to view my session and thank you for everybody at the summit for taking the time to put together this event. It is a tremendous amount of work and I appreciate that. So before we begin, I just want to briefly introduce myself. I think it's important to understand who's up here talking to you and why they feel like they have, I guess, the right to do so. My name is Chris Riley. I am a DevOps advocate and in developer relations at Splunk. We have a lot of familiar with Splunk as a logging tool, but we are much more than that. We have a lot of tools called our observability suite in the DevOps area. And my job is to represent the practitioner both externally and internally in the area of DevOps. So if you scan that QR code, you'll get access to my three podcasts now, which is a crazy amount of podcasts to be doing, as well as other information to connect with me. And I always love to connect with people who've joined my sessions because it's good to hear, you know, what you thought, what you learned, what you didn't learn maybe and also continue the conversation. So this session is all about open telemetry. More specifically about how open telemetry is building pipeline resilience for enterprises, which is critical in the modern DevOps environment. And that's where I want to start. I really want to focus on the business side, the business benefits of this technology. Because it's important to realize that a lot of times when we talk about DevOps, the conversation is all about releasing code faster and better. But the mechanisms for releasing those applications are also important. It's very important to understand the framework, the guardrails, all the tooling, which I refer to as the management plane for managing your cloud, doing CI CD, doing pipeline analytics, getting data from your tooling for monitoring, etc. Are just as important as the velocity of the applications itself. And now, you know, our delivery chains are always changing. The nature of applications are always changing. We'd like to believe at some point that the change is going to stop. And we can say we have built the DevOps environment. But the reality is new technologies are coming out constantly. And being in a market, I understand that there's new new approaches, new tooling being released on a monthly by monthly basis. The changes in our application architectures and how we deploy our applications are directly impacting how we operate them. And so even if you don't want to think about how we manage these applications, you really have no choice because what used to be deploying code onto static infrastructures, a lot of times we would even name the servers we were deploying code to has transformed into distributed systems via microservices, Kubernetes, what once was very slow moving deployments to infrastructure that never changed is now a very rapid deployments to infrastructure So that's always changing as a matter of fact, the boundaries between your application in the infrastructure of blurred. So now when you deploy applications, you're actually deploying infrastructure along with it, and it's ephemeral and stateless. You tear it down rapidly, along with every single deployment. The application relationships the relationship between the components in the application are no longer in a single binary, or a single part of code. They're just there, they're distributed across services, each service has its own stack. Each service has its own team building that service, the relationship between those services, also called contracts is a little bit unknown. A single operator can't intuit the relationship between all these services, you know, starting around about around 15 micro services. It's impossible for anybody to really know all of the dependencies and relationship between those services. So identifying issues and getting data into the monitoring platform the management playing for this is a new challenge. It's different. And back where when we had static infrastructure we would start with the log. Now you can't start with the log. The log is the destination. Hopefully you never have to get there, but the log is the end point, not the starting point. So, inevitably, monitoring has to evolve. And it has to evolve into this new industry term we use called observability. The first question I get when I use the term observability is, is observability just monitoring. Yeah, ultimately, it is. The way we look at observability is observability is the data. And the practices around that data are your modern monitoring practices. If you're a good techie curmudgeon like me and you hear a new term, you probably just roll your eyes. But if I can convince you that the reason using the term observability is important, given the context of everything we're going to talk about around open telemetry is because the landscape of monitoring is changed. It's a new problem space when we talk about monitoring distributed applications and modern cloud native applications. And that problem space is significant enough that we need a term to be efficient about talking about it. So, before I go on, we really need to hone in on some key terms. So the first is observability and I've defined that for you and I'm going to use observability to talk about how we leverage open telemetry in our modern applications. And then we need to talk about the data sources. So I talked about observability essentially being data. What does that mean? Well, we have two new key sources of data, namely traces and spans. Traces, spans are an attribute of a trace. And there's span context that are also a part of traces. Most of us are used to metrics, which is a measurement. And we're also used to logs, which is traditionally what we think about when we think about monitoring. All three of these are pillars of observability. So perhaps traces are new to you. Let's dig into those a little bit. Now, they are a standard. A tracer is how you get context from a span. Sorry, from a trace. Spans are the time spent in any one call as a part of a trace. So if you think about it, the trace is the transaction from the point that a user enters your application to when they leave the application. Inside that transaction, they are spending time in specific calls. And those are your spans. So the amount of time spent there and what is done there. There's different types of spans. You can have client server, producer, consumer, and then also internal within the application. Spans have attributes, which are key value pairs and have tags, which are very critical metadata that you add to the spans to enhance their value. One thing I will say here that tags are a information architecture challenge, not a technology challenge. So don't expect the technology to decide tags for you. That is something that your team needs to decide from an information architecture and application architecture perspective. And then you also have events and links as a part of spans. Spans also is a part of traces. We have span processors. So these are what send, maybe from a batch perspective or from a, you know, one in one out perspective. And then we have exporters of spans. So these are the tools that take from your infrastructure or your application, the spans and the traces and correlate them together and send them to your monitoring tool. So now let's talk about metrics because this is another key aspect of observability and cloud native applications. Metrics are a measurement over a time series based measurement and an element in your application you choose to measure. A lot of times in the context of metrics, you'll hear about red metrics. Red is generally used for application performance monitoring, and it stands for rate error in duration. These are all key components to first quickly visualize your application for the team to understand the health and the performance of the application. But in troubleshooting, understanding how these metrics relate to key thresholds and SLOs that your application should be meeting. So really the metrics themselves or read in this case or in this example are the SLIs. These are the indicators that you're using. So the raw aspects of a metric are the measurement. So what you're basically the name of what you're measuring like rate and the actual measurement. So the values over a period of time, you're going to be aggregating these and most monitoring tools are going to keep some some span of time of them. What is important to consider in your monitoring tool, both for traces and spans is the fidelity. So a lot of tools will sample some will keep 100% in support and understand the impact of that in your environment. So let me start to you may be asking why haven't you talked about open telemetry yet well all of that foundation is important for me to explain to you the value of open telemetry and and why you should consider it as a part of your environment. So we use this term a lot called GDI getting data in. This is how you extract traces spans metrics logs from your application and transport them into your monitoring plane. Now in most instances and we're we're used to looking at this from a very proprietary standpoint where the monitoring tool is also dictates the implementation of GDI. It's usually in the form of an agent. Those agents are proprietary. And as they update you will update them and you're deploying them across your infrastructure. If you have multiple monitoring tools, you might have multiple types of agents. Well, GDI is is something that every organization needs to deal with. We're all getting the same kind of telemetry from our applications and our infrastructure. And because of this, it doesn't really make sense that we would approach GDI differently for the monitoring tool or whether it's infrastructure or application, whether it's agent or in code instrumentation. So by treating it as a proprietary thing. You've actually limited yourself from growing in the future, because now, if you want to change your, your monitoring plane, either from a configuration standpoint scaling it, even, or actually changing the tooling itself. You also have to change how you collect the data. And it's not just a trivial change of replacing one agent for another agent, because those agencies start to determine and impose practices upon you. Really, your your process of getting data into your application should be completely unshackled. And it should be decoupled from the tools you use to monitor that application. And it should not rule your infrastructure because once it does, it actually dictates how you modernize your application. Also, as you go faster. And you need to bring more applications you need to onboard more applications in your environment. Without having a standard and an open way to collect data from those applications. It's very hard to onboard. It's your service onboarding process now has a whole slew of considerations that have to be made for your agent and the configuration of the agent, etc. Which means that you simply can't scale you can't scale, because you might have snowflake agents across your application, different types of agents for different types of use cases for different types of stacks. Or simply because even if it's the same type of agent, there's snowflake configurations in those agents to support new services. One of this is enabled. You need to think about enabling scale decoupling your monitoring plane from the application and how you instrument it and building in pipeline resilience. Resilience is a mindset and embracing open instrumentation and open data collection from your applications supports that mindset so that you can be better at building more resilient pipelines and delivery chains. The reason for that is because your instrumentation to get that data out is standardized, no matter the application infrastructure, no matter the tools you're using for monitoring it. So that is the goal. And that is why we really want to think about changing how we get data into our tooling. But there's another element of this. There's a missed opportunity, because there is a lot of data that comes out of your infrastructure and your application. And that data is powerful in a lot of ways, some ways negative. For example, private information, PII, getting that data into your monitoring tool could be risky. Perhaps you don't want that data in your monitoring tool. You need to massage it and make sure that your operators don't have access to that information, even though it is present in the data of the application itself. The other missed opportunity on all this data coming through is it be useful to apply business logic on your GDI on your telemetry so that you can be more efficient and effective with the data that's coming in, maybe manipulating data or splitting data into multiple streams to better leverage that information in various ways. And that gets to the flexibility of data flow. Instead of having proprietary agents, which a lot of time are dumb agents that just take whatever they get, they throw it into your monitoring plane, you get to manipulate that data. You want to be able to split it into multiple paths. Either some information goes one way, other information goes another way, or all the information goes to multiple tools, like one goes to a logging tool, one goes to an observability tool. You want to have that control because one for one gets you into these weird scenarios where you're actually passing data from one monitoring tool to another monitoring tool. Sometimes three, I've seen that type of configuration as well. All of this is leading to the idea that we need to look at technology and standards to help us embrace the business value of our telemetry coming from our applications and infrastructure, and maximize that, as well as unshackle us from the GDI aspects of getting that data into our monitoring plane. And that's where open telemetry comes in. And I put hotel up there, like all good industry terms we like to abbreviate. We do it for Kubernetes in the form of K8s. We do it with observability in the form of O11Y or OLE, if you're really trendy. We do the same with open telemetry and even within the open telemetry project where we call open telemetry hotel. So if you see hotel, it means open telemetry. And yes, I know this can get frustrating, but it's useful for efficiency purposes, but I generally will not use in conversation because it's better to stick with the full open telemetry name so people understand what you're talking about. So the open telemetry project is a CNCF open source project for GDI for getting data from your infrastructure and your application into your management plane or other tools. You have a lot of control here. It's built on open telemetry standards. So first are the standards, then you have the collector and the services implementing those standards. The implementation comes from a collector and an agent. And as always, around open source, a community. So it is important to have a community of people contributing to the project and a community of people engaging in helping each other to be as effective as they possibly can around the project. And it is a tremendous community. Here is where open telemetry fits in the nature of applications because we know that the typical enterprise is going to be in various stages of their cloud journey and their DevOps journey. And that's okay. And that probably is not going to change for a long amount of time. So in organizations typically going to have some monolithic applications where you use proprietary agents. You have hybrid applications where you have some cloud, some monolith applications, and you need to, those two may even need to talk to each other. And then you might have cloud native applications. So these are applications born in the cloud. And really what we're talking about that necessitate observability, especially necessitate an open way to collect this data from the applications. And really where open telemetry fits the best is in the cloud native application. When I say cloud native, I don't just mean cloud. I mean a cloud-based application that is generally leveraging microservices and usually Kubernetes. So the nature of the application has also changed in addition to the destination where it actually runs. Now, let's dig into what open telemetry really is from a technology standpoint. So if you think about those key components of your application, traces, metrics, logs, those are the vertical aspects. So these are the data essentially. The way you work with that data is it could be through APIs, various formats. It's going to involve architectures like, are you using the sidecar pattern? How are you using collectors? And a wide range of data formats, just a tremendous amount of formats of this data. JSON, key value pairs, et cetera. So open telemetry is meant to encompass all of those. Logs are a feature on the roadmap. They're not there yet today. But traces, spans, and now metrics as a very, very recently are supported. But the idea is that open telemetry is going to encompass all the layers and all the pillars of telemetry that is coming out of your infrastructure and application. As I mentioned, community is such a tremendous part of this and why you do this in any open source project. So there is just a tremendous backing between open telemetry, between major cloud providers, top industry vendors, including Splunk. This project is a very important project to us and we have dedicated a lot of resources to contributing and making it successful. As well as a ton of large cloud, cloud native application vendors like MailChimp, Spotify, Postmates, who are actually leveraging this at scale in production, which is very important. There are even other open source projects because these come in to exporters and providers that are getting data in or consuming data from open telemetry. I know those URLs are small. I'll give you a chance to screenshot this quickly because I'm not sure if you're going to be able to get the slides so that you have access to those. But there's just a tremendous amount of information regarding the project on OpenTelemetry.io. A few things just from statistics about the project I want to clarify. It's not yet general availability. So the project itself is still beta, but it is very widely used and companies have been very successful with it. And like I said, the community is tremendous. Just go and look at the dev stats for it and the roadmap for it. And it's fantastic, which is important to understand that clearly it is going to get adopted. It is going to be adopted widely. And if you adopt it, you're going to have a lot of utilities out there in the community to help you with implementation. So just go back again. It's important to reiterate why is OpenTelemetry so important? Well, it creates a decouples how you get data into your monitoring tools from each other. So the tooling that collects the data from the tooling that makes utility of that data. That's very important because it's vendor agnostic, which means if you change your monitoring tools, you don't have to change the way you collect those tools. It allows you to centralize and be consistent what you do with tags. If you want to implement business processes like redaction of PII before it gets into the monitoring tool, you get to centralize that logic. So that as you scale and add new services, it's consistent across all your services, across all your infrastructure with no additional effort. It also brings additional capabilities that you might not find in proprietary agents like compression encryption and logic around high performance and availability. Really, it reduces the time to value. So it reduces the amount of time from onboarding a new service to getting value and utility from the data from that service because you don't have to think about how you're going to instrument. You're going to set a standard, one standard for your organization that's going to be used no matter the infrastructure, no matter the application. There might be variations of that, but even those variations are standardized. So there's no question when you onboard a new service or you make a change to your service, if you have to make a change to the data collection as well. Objectives should be very clear, but these are also the objectives of the project to offer a vendor agnostic implementation of receiving, processing and exporting to telemetry across from application to monitoring. We want to make it very easy to use. So everybody in the project is committed to increasing the ease of use of deploying the open telemetry collector, working with the processors or the data, and also exporting that data to your management plane. We want to have consistent formatting across all metadata across all traces spans logs metrics, and all of this is to improve observability by being built by a community of industry leading vendors, as well as practitioners, they are building their best practices into the technology as well. There also is the ability to make it extensible so the the platform for building plugins into open telemetry, because there is no one size fits all application and environment is also tremendous to reduce the time to value of creating those custom integrations. And then there is a single code base. There is the contributor code base, as well as any extensions of that, but it is a single code base for the collector to support logs traces, or today traces and metrics and spans and in the future logs. The components of the project are the specification the standard this is very important because this is what everything is built on. So those are going to define how traces spans metrics logs are sent and structured. So it's it's very much around the data and cement semantic conventions around the data. Specifications like this can be hard to consume but it's important to understand that the specification is there, and the collector and the client libraries are all implementations of that specification. It's useful for you to review the specification so you understand the why between how these collectors and client libraries work and why it matters to your organization to standardize that. And then of course we have the collectors themselves, which could be the agents or how it's implemented in code with client libraries. Now that's an important distinction to make. There's benefits to implementing in code. There's benefits to using agents and which one you choose is going to be based on several, several aspects, whether the, the stack your, the language that you are writing your applications in has auto instrumentation. But also a lot of times organizations have varied stacks where each service is written or services are written in multiple different languages. In that case, in order to embrace ease of use, where you might have auto instrumentation for one stack and not the other. Maybe you use agents across all other on the agent side, you know, you have to think about things like scalability. A lot of times it's easier to scale agents than to teach your development and engineering teams best practices around implementing with client libraries. So there is a lot to consider here. And a lot of it is not necessarily technological. A lot of it has to do with the application architecture itself, the size of your engineering teams, and the growth, how the application is evolving over time. You might have a combination. Normally you see organization standardizing on either in application or agent based implementation of the open telemetry collector. The components of the collector are fairly straightforward. Both for traces and metrics, you have receivers and exporters. There's a lot of standard receivers out there like Yeager and Zipkin, but there is also the OT LP, which is a standard format both for a receiver and exporter. On the traces span that's typically where or the traces side that's typically where it's going to end because these are very common ways of getting traces and or traces and spans out of your application on the metric side, you might have other receivers such as Bethias and your host by itself. So data around CPU disk memory, etc. OT LP, the hope is that this is going to become the more common use case not to be confused with OLTP, which is another protocol out there. But the hope is that organizations will embrace that as a best of breed for also making sure that your receivers and your exporters are based on open standards, as well as the collector itself. Then you have processors. So these are what manipulate the data, essentially in stream within the collector. You can do it in simple, you know, processing, you can do it batched based. There's logic here for high availability to make sure that there's retries and the data, you know, the transaction doesn't fail with inside of the processor. You can implement sampling here if you choose to, depending on the best practices, we tend to take the approach that you should not sample, and you should have full fidelity in traces and spans, as well as the cardinality which is essentially all the transitions of metadata around the traces and spans, but for different reasons you may choose to implement sample sampling, and you can do that in the processor processors today are available in traces but not yet metrics. There are other important aspects of the project. We've already kind of beat up on the idea that, you know, there is, there is a huge community, but there is also a really strong governance board because we do have a lot of vendors involved. This implements the code of conduct is the steering committee for driving the technical direction of the project and making sure that things like documentation which are such a huge component of any platform, any technology, are updated current in there and communicated with the community. And then there is protocols. So a lot of work being done around the OTLP protocol to create so more standards around getting data into the processors from an exporter and receiver so in and out. Let's look briefly at, again, what the collector is so it's a vendor agnostic implementation of the open telemetry standards. It is a single binary that can be deployed in different ways, either as an agent and and or a gateway. And then of course the client libraries are going to be what are implemented in application. And it is the default destination of the client library data. So let's look at the architecture of the collector. So as I said, you have receivers. So you're going to go and define your receivers, how you're getting data into the platform so the protocol, the type of receiver, and any configuration attributes to that. You're also going to have your exporters. This is the destination of the data. It's going, it could be, well it is going to be some sort of endpoint and some sort of authentication to that endpoint. And then finally you have everything in between where a lot of magic can happen. So these are your processors. This is the relationship between the receiver, your processors, the manipulation you do on top of the data, all prior to getting to your destination via the exporter. The cool thing is, is it's not one for one. So you can create multiple destinations with multiple receivers or one receiver to multiple destinations. Like I said, you know, the hope is that there's a standardization around OTLP, but organizations need to adapt and this is exactly the point. You might be adapting your receivers, but you don't have to change how you collect that data. You might be adapting your exporters, but you don't need to change how you collect your data because that is an immutable element of your environment and the infrastructure that you're managing. Now, as I said before, there is this capability of building extensions or add-ins that might be specific around what you're doing. The community has built a lot of these. Always interesting to look at as you look at, especially business logic and manipulation of the data in string. So let's take another look of how this kind of comes together. You have the agent, you have client libraries on the application level, or you're going to have tooling on the infrastructure level. We haven't yet talked about serverless. There might be certain wrappers around serverless so you can get the telemetry data from serverless applications as well. And those are going to go through a gateway and eventually make it to your observability tool. How you're going to get started is more or less the same whether it's traces or metrics. You instantiate in your client library or via an agent who is the instantiation. You're going to instantiate a tracer or a meter. You're going to create, your infrastructure application is going to create metrics or spans. You're going to enhance those with metadata, potentially tags, etc. And then you're going to have some sort of observer of that data prior to sending it to your monitoring plane. The binaries are available in all the areas you would expect, Linux, Windows, but there is also YAML and Helm charts for Kubernetes. And of course, you could imagine all of these packages are going to expand as the project goes into other areas. Now, it is written in Go, so it is a compiled technology and that's important to understand because it also creates a lot of stability and supports the performance of the application. So let's take a look quickly about how this is implemented and I say this is somewhat trivial. This is in JavaScript. Basically, the process is you create your tracer, as we said. You start to create span data coming from your application, so you're going to start your span, you're going to close your span. With inside of your span, you're going to apply attributes and values associated with it. I want to highlight that aspect because once it gets to the monitoring plane, these attributes and the relationship between the attributes are a key element to reducing MTTR on your application. So when you're triaging and troubleshooting your application, you need to be able to use these attributes as dimensions for looking at a service map, which is what I showed before the relationship between all your services as a way to identify the root cause of issues because in a microservices based application, the service that delivers the alert is not often the source of the alert. That's exactly why you need traces and spans, but within those traces, you want to be able to look at the dimensions to really identify is the issue happening in what service is an application issue or is an infrastructure issue. And can we start to double click into things like the customers being impacted via tenants, the version because you might have multiple versions of a service deployed across your application, etc. Very, very important element. And again, this is an infrastructure architecture challenge. How you define these, it's not a technology challenge. So you need to think about this in advance for successful troubleshooting. You can always update it, of course, but it's better if you have a good understanding to be a month. So again, I'm going to go back. This is a manual implementation in JavaScript. It's nice not to do it manually so seeking out auto instrumentation for your stack for your programming language is very helpful. This is how it's done with Java and it of course makes every everybody's lives easier, especially when you want to do this at scale, the auto instrumentation is critical. Always keep your eye on the roadmap for that. If auto instrumentation is available, leverage it. If it's not, maybe you default to agents instead. The other thing that's important is not to make some mash your auto instrumentation. Use one and only one. Well, same is true with metrics. Very easy to implement today. The auto instrumentation doesn't exist. And but the implementation is not difficult. So you instantiate the meter within the meter, you're going to define your units of measurement, and then you're going to create an observer to get that data out of your logs. Well, I want to reiterate and I'm going to leave this screen up here for a few seconds here so you can screenshot it because there's a lot of great resources but before we go through those open telemetry is a CNCF open source project, which is critical to unshackling how you get data and telemetry from your infrastructure and applications into your management plan. It brings the additional benefit of having processors and logics that you can apply on that data in stream and create multiple paths for that data. The reason we do this though is not just to streamline and create more efficiency in our environment. There's also the business side of it. Make sure that as your delivery chain changes and as your management plane changes, you do not have to change how you think about collecting that data, which is absolutely critical because if a proprietary agent determines how you manage your operating plane, then it can be the difference between saying yes or no to something that could be extremely useful in your environment. Open telemetry starts with the standards, the collector and the agent are implementations of those standards. They are composed of receivers, which take data from metrics, traces and spans and in the future logs from your application and infrastructure exporters, which deliver that to your management plane and then in the middle is a bunch of amazing technology for processing, manipulating and creating resilience around that data. So we firmly believe in this project, the community believes in this project and I hope you consider it as the foundation of the relationship between how you monitor and operate and the applications themselves. There's a ton of information out there. I gave you a ton of information. I understand that but here are some great resources around the specification around the collector. There is a fantastic CMC F demo from Steve Flanders who is somebody on my team who has taught me everything pretty much everything I know about open telemetry and a big contributor to the project. To watch that it includes everything I showed today including a demo. There's also a webinar I told you that tags are a critical aspect of getting value from the data utility once you get it into the application. And that's something that the team needs to determine from an architecture standpoint. So there is a webinar for that here. I wrote a blog post on the business value of open telemetry on devops.com. You can view that as well as just a lot of information on getting statistics around the project, more detail around resources, learning more just around observability traces and spans in general. Thank you for your time. Please reach out and connect. Let me know if you have any questions. Let me know what you think of the session, or that you just saw it. I always love connecting with people who have joined me in my talks. Thank you. Have a great day and enjoy the rest of the event.