 So first, what is observability and why are we talking about it? Let's start with a little bit of level setting. So systems have functional requirements, you know, cars, software applications. Functional requirement is specifying what a system is supposed to do, right? So a car, for example, transports people from one location to another. Systems also have non-functional requirements that specify how that system is supposed to be or defining it, in other words, defining its qualities. So comfort or safety specifically for this example. Protecting the options from harm during transport is one of the qualities that you want from a car. And these are just one of the many illities of the modern software and engineering applications these days. There's usability, scalability, testability, you know, readability, maintainability. Availability, all of these illities. And observability is just now one of these illities, right? And applications, different applications have different scales for how much of that quality needs to be present in that application. For example, maybe the Blue app needs to have a scalability twice as much of a greenout on this example. Verne Vogels, the CTO of AWS actually explained observability really well in one of his keynotes for re-invent in late 2020. So I'll play his video and kind of let him describe observability. When you talk about the history of dependability, I mean some systems theory. The related field of that is systems control theory. It has been crucial in building many dependable industrial and other systems. The most important pioneer in this field was Professor Rudolf Kalman. In 1960 he defined concepts such as whether a system was controllable and observable or actually inobservable. To be observable is something we know as software engineers all too well. How can you infer the internal state of a system from its outputs? This can be functional outputs like voltage and amperage of the turbines or non-functional output like the turbine temperature sensors or rotation speed measurements. And this is what we try to achieve with the observability property of a dependable system. How can we infer the internal state of our digital systems from its outputs? This can include both functional, requires from your vehicles for example, or request to other parts of the system from non-functional information that we collect for other needs. So inferring the operational state of a system based on its outputs. You've got your inputs in your system which is opaque and then you have outputs which you can see, measure and analyze. And observability has a quality scale itself, right? So just like scalability or maintainability, there's on one end it can be, you know, your outputs leave the system still completely opaque and you can't see into it. Or on the other hand you have excellent observability and your outputs render the system completely transparent. You can see everything that's going on within your application in your system. So to illustrate this, we're going to time travel back to the early 2000s with a very, very simple PHP app. Maybe this PHP app just took an image, extracted URLs or metadata from the image, and outputted a simple page that allowed you to see those metadata and attributes of that image. Well, if you have, if you had, you know, things like just server monitoring your CPU and memory of your application, of the host of where your application was running and then maybe you added a little bit of the Apache server application logs to see the amount of traffic or the request per second and the amount of errors, you're actually maybe doing pretty well for this simple, simple PHP app. You know, your observability quality scale might be toward the high-end for something this simple. Why is because while this is a very simple application and early on, you know, 15, 10 years ago, there was mostly identical years of experience through the entire application stack. So all of your visitors essentially could be thought of, you know, the same, the same kind of experience. You know, there's no individual in application experience that was different from one visitor to another. And if you threw in a little bit of testing above and beyond this simple server and maybe Apache log monitoring, you're actually doing pretty well and you're seeing, you know, a very well or very high coverage of your stack. That's because there were limited pathways to kind of interact with your application. Traditional monitoring is very low dimensional. We treated all of these experiences as the same. You know, we could measure and it lacked the contextual differences that separate the one app experience from another. And now let's talk about why monitoring is not observability, again with the murder vocals of the CTO of AWS. Classical monitoring deals with two questions. What is broken and why is it broken? Monitoring uses a predefined set of metrics and logs in determine all values. In general, with monitoring and alarming, you can't predict when things will fail. You can only take action when they do. It is why we used to call people who molested these systems operators and not engineers. The generators here at Sugar City are extremely complex. It is very likely that the operator of the dashboard knew how to repair it when something went wrong. But they did know with the Saudi alarm and gauges moved to where they were supposed to go. As systems continue to increase in complexity, it is impossible to put every important metric for that system. We'll say, oh, that's what they've been used towards us. Think about everything that goes into a modern application. There are metrics for the service, containers, and functions that you're managing. Your application has counters and logs for all the work it's doing. You may have anywhere from thousands to millions of customers. All which have data about what they're doing and how they interact with your application. It is impossible to put all of this on a dashboard that a human watches to define alerts for each of these metrics to tell you we're not going on spec. And Amazon, we've been on a 25-year journey to improve the processes of managing our systems. And we've long left a notion that just monitoring was sufficient to manage our systems. We've embarked on a holistic approach to operations, from collecting massive amounts of data and logs, to how we analyze them, to how we solve and talk about problems when we do have them. And this is what observability is all about. How can you make sure we have the data, the tools, and mechanisms to quickly resolve problems in a fundamental way? How can we, rather than reaching into the system in the first internal state for the data that we have? And Amazon, our most important drivers, have always been customer-centric. Find any solved problems before they impact customers. Understanding impact on your customers we couldn't prevent it. And fix the problem so that it never happens again. So he made a really important point here, is, you know, monitoring is not observability and with the most complex apps these days or with the applications these days growing more and more complex, more services, interoperability, more individual user unique interactions with an application. You can no longer fit your monitoring and the amount of data that you need for observability on dashboards. Systems evolve from simple to complex. This is true for applications today. It's also true for, you know, a telephone. Think about the 1900s candlestick telephone, right? It had different requirements and different complexities than, say, the modern smartphone of the 2020s, right? Different usability, scalability. All of theilities were different through these generations of telephone evolution. It's true for automobiles and for software. You know, you can't imagine these software Facebook or Google calendar or Twitter existing today as they were, you know, 15, 10, even less years ago, you know, with the amount of features and interactivity and infinite stroke scrolling, AI applied to feeds, all of these different things that make these applications just more and more complex. And complexly like entropy just increases, right? It's true for phones, true for applications. So monitoring is actually now, you know, a subset of observability. In that modern or in that early candlestick telephone with your traditional, say, basic monitoring of that system, you might actually have been doing pretty well with your, you know, observability. As we talked about it today, what it encompasses to make sure that you can observe all of the performance qualities of that system. But that's just not true today, not true of the smartphones and not true for applications with traditional monitoring. Again, using the example of the THP app from the early 2000s, you know, with our server, CPU, memory, and then Apache logs, it might have been pretty excellent back then. Those limited amount of metrics and throughput metrics and server metrics don't cover your observability needs for a modern application today. So everybody visiting your PHP app, you know, 15, 10 years ago, never really was the same user experience, but it was often a reasonable approximation. Now it's just not reasonable at all. Modern apps are just much richer, offer a wider variety of user experiences. It's different for essentially everybody using applications these days. Once you log in or start interacting on a personal account on these modern applications, there are so many different ways that the application can behave or respond or execute different things and call various APIs or external services that are just totally unique for each individual's situation within these modern applications. And so monitoring really is the tip of the observability iceberg, right? Traditional monitoring is low dimensional and low carnality. Things like aggregated metrics or summary statistics are the things that we used for monitoring up until recently. And that sits just on a huge iceberg of contextual data, high dimensional, high carnality data that you need in order to achieve true observability. For example, here's a request latency histogram showing two peaks, one around the one second mark and then another around the, say, seven or eight second mark. So two peaks along this and you kind of get that distribution of this aggregated metric for the request latency histogram. Well, if you add some contextual data to that and you attach these tags for this metrics that we can break it out and get extra dimensions based on whether it's a desktop or mobile or whether the location is in a certain city. If you attach that geographical and platform information, now you can break this out and start to see what the composition of that overall chart was. And we can see that if we actually go down through the desktop and then down lower to the bottom charts of those metrics, we can see that that second peak actually matches the New York City and Boston and D.C. area for desktop users. So we know that that's likely the problem area for this latency histogram on the higher end. The good thing is, you know, with modern monitoring, we are starting to bubble these up, right? We're starting to bubble a little bit of this contextual data up so it can be a bit more, give you a bit more automatic insight into kind of what those metrics may be needing. And that's just a few dimensions. So, you know, people have been kind of taking the approach of, well, we can handle that, we can just add this contextual data to our metrics and we'll just make more dashboards, put those more on more dashboards and let's just add more monitoring for these dimensions, right? And that's just more monitoring. That's just more dashboards, right? You're going to have a problem of, you're going to have just more and more dashboards growing as you add more context. And it's just, as Werner said, it's impossible to observe your systems using the monitoring metrics and dashboards that we've used up to this point. So how do we achieve the excellent side of observability, making our systems and applications as transparent as possible? When you read about observability or the traditional pillars of monitoring, you'll read about metrics, traces and logs. We'd like to think of them as these four different constituent elements of observability, which we define as logs, metrics, spans and events. Logs are your traditional logs that they include the attributes and context identifiers attached to those logs as well so you can correlate the logs, these highly specific debugging tools in your logs to the metrics, spans and events that happen elsewhere. Metrics are your aggregated summary statistics that we're used to in traditional monitoring, but including the attributes and context identifiers, again, these are the things that are going to sow and thread the logs and metrics and spans and events altogether to kind of compose your observability system. Spans are time span tag locations within services or code or your stack to allow you to know at what point in time you were or the request was when that specific period of time happened. And then events are the discrete individual events at a point in time that were encountered within a time span, again, with just as many attributes and context identifiers as we can attach to these. The kind of analogy here is these constituent elements of logs, metrics, spans and events aren't necessarily super interesting by themselves or useful by themselves, but together they compose the elements and molecules and eventually the planets and universe that exist. So these are the constituent elements of observability. And the context and attributes are the key to observability. So an extremely simplified example of what observability may be with in the context of application performance tracing. So let's say a user triggers an action on the front end of your application. You'll start a trace or a span here that's automatically generated from your observability and instrumentation. And you can attach these identifiers, this contextual information to this trace so that we can identify maybe the front end version or the page route that this user took. Even the user metadata like their ID or the product plan that they're on, which device they're using, the location, are they in Chicago? Are they in Boston? Are they in D.C.? Where are they located? And that front end call then makes the back end call on your server is hosted on your infrastructure and you can attach back end metadata to that. So which end point was hit, which post it was, which container ID, which Kubernetes cluster, things like that. And that back end API call makes the database call, right? And then we can create a metrics off of that. We can maybe say, okay, we want to count how many times this database call was actually made and service that as a summary statistic later on. And then that database query is eventually executed on the database, over the database connection and we can tell when the connection was actually established and make that a point in time event. If that database query returns, we can then log that as with specific database call information. We could log different information about that database call depending on the time of the token, things like this. So now we've made a metric, a counter metric, an event in the log. And maybe we make an external service call. This could be a API call between your microservices within your individual infrastructure, or it could be, you know, a third party hosted API call from anything increasingly sass or third party vendor hosted service that makes the API end points exposed for services that are critical for your application to execute. So that external service execution returns, we can log that as well. We've made a response and we send that back to the back end and that goes back to the front end. But maybe the front end action is not completely done yet and it needs to make another back end call, right? So we've still got that top level front end action with the identifier, the action identifier that is actually attached to all of these other calls, the first back end API call, the next database call, the database query, the external services. And in addition, now we're making a second back end call page maybe to load a paginated list of data on your application. And again, we make a database query. We record the connection established there and that returns the amount of items and we make a count metric out of that as well. And then maybe it loads, preloads maybe the second page of paginated results and we can thread all of this through and connect all of these different events back to that single triggered front end action. So very highly contextual and correlated, now distributed tracing in this instance. And then we can log the action outcome metadata on that as well. So you can see that context connects everything. And maybe the more context and the more data we can attach to these events and these traces, these logs and these metrics allows us to correlate all of this information together and allows us to search and surface insights about all of these related events. And until recently, you could do this to an extent, you could buy it from a vendor, a large vendor, typically, and or you could try to build it yourself and a lot of it from scratch and it was expensive either way. So there's two, there's two key trends changing unit. The first one is cost effective and performing disk storage and compute resources for high cardinality, big data. The second one is open source, observability APIs and SDKs and tooling implemented collectively, collectively with broad support from the community and vendors. So we'll talk about the second one first. Open telemetry is a fairly new project that's been getting a ton of vendor support and community support and is emerging as a new lingua franca of observability. Open telemetry defines a cross language specification for how APIs and SDKs responsible for collecting and configuring the telemetry and instrumentation of languages and frameworks and libraries should be done. It allows the standards to be applied in a standard fashion across all of the major languages that then have their own libraries written to adhere to these specs. The API and SDK libraries are written in the native language of these programming languages that implement this specification the same way. So that we have cross functional, cross language APIs and SDKs that all handle the data in a standard way. And then there's the open telemetry collection and agnostic export formats that allow these languages and the APIs and SDKs to send that data to a collector and further send that data on to multiple back ends, whether it be your own destination, your open source tools, Zipkin or Yeager, and additionally to third parties like Scout. So Scout application performance observability as we're evolving into is built on open telemetry. And some facts about open telemetry, the top active cloud, it's now the top active cloud native computing Federation project behind only Kubernetes. It's backed by over 130 companies, including the big ones like AWS, Google, Microsoft, Facebook, Splunk and Scout. It's integrated directly into open source projects and cloud native stacks instead of injected into them. And it's simple to use with extensive ability to customize instrumentation specific to your environment. And so that's great, right? So now that there's this vendor agnostic and kind of standard way to apply this telemetry and using these APIs that are that are defined through the open flow machine specification and available for open source libraries and packages and framework maintainers, there used to be this wall in between the observability instrumentation like Scout would write itself and apply as an outside in patch to these libraries and frameworks in order to gather telemetry data. And so we're kind of breaking down that wall and the package maintainers, they would have never, we could have never approached a package maintainer like rails and say, hey rails we want to put Scout APM's observability instrumentation into the rails framework and committed to upstream to the rails project. There's no way that they would apply that vendor specific telemetry into the rails project. So now, breaking down this kind of wall in between those with the vendor agnostic APIs and specifications, it's now more acceptable and more libraries and packages and frameworks should be more willing to incorporate this open telemetry into their libraries and frameworks. So that's built natively into these libraries and frameworks instead of injected by vendors after the fact. So let's talk a little bit about the components of open telemetry. There's the specification which defines kind of how things should be laid out. It makes the standards of what we're talking about and naming and operations. And then there's the API, which is the application program interface that library maintainers and framework maintainers will use. And then there's SDKs which are meant to be used by the end user or within your application in order to configure and use open telemetry within your application. And then there's also the collector, which is a means of collecting this data sent from these SDKs to the collector as kind of a proxy of either further aggregation or processing of this open telemetry data and sending it to the back ends. The collector, though, is optional. So the specification. The specification actually defines how the API should behave. It defines the standard naming, disability guarantees, the separations of concern, the communication protocols and the data formats and the compatibility, the backwards compatibility guarantees that it makes for the API and the SDK, et cetera. Symantec conventions are just the way that we name things. This could be things like container ID or cloud region. Like what could we call the key value pieces of these where we can find that contextual information in a standard way that we can connect all of the logs, metrics, traces and or logs, metrics, spans and events together to kind of leave them into an observability picture. Fliced and diced through any possible combination that we want to investigate. And then OTLP is a transport. It's called the open telemetry line protocol. It's a transport protocol that defines how the communication between the SDK and the collector should behave. The API component of the open telemetry project is the purpose is to develop an API library in every major language to be used by library and framework developers in order to add instrumentation natively into the library or framework. The library authors won't care if any users are using open telemetry or not. They can implement the API and put the hooks in to collect the telemetry information using the API. And if an end user is not actually collecting the open telemetry data from this library or framework, it will not have any overhead. And they won't need to distribute an instrumented version of their library or framework in a non-instrumented version of the library or framework. This is completely transparent whether you're using or not using or not collecting the open telemetry data from this library or framework. It's zero overhead if the user does not enable open telemetry in their application. And it provides very strict stability guarantees. So the library authors and framework authors, they won't need to be concerned about, well, is this API going to be changing? Is it going to break things? It provides a very, very stable and strict guarantee to these library and framework authors to kind of ease their concerns about affecting their library or framework. The SDK piece was intentionally decoupled from the API piece. So there are separate packages on every single language. And the purpose is to make sure that the API is super stable and the SDK can evolve and change in quicker fashion as needed where it's user-facing interfaces. So every major language, this is intended to be used by the end users in your applications that then serve your users. And you can enable the open telemetry data collection using the SDK and start consuming the open telemetry data emitted from any of the libraries that you use. You use the SDK to set configuration options for what gets collected, what processing is applied after the collection, and where the metrics are sent. And I just discussed that it's intentionally decoupled from the API for stability guarantees and dependency flexibility. And all languages with an open telemetry SDK can send data via OTLP natively. So that's an important part. If you're using open telemetry within your application and you set the SDK up to gather this open telemetry data, you don't need the collector piece, the separate collector binary piece of open telemetry in order to send this data somewhere. So that makes this collector portion of open telemetry optional. But the open telemetry collector is an application written in Go that receives open telemetry data from end user applications. And it can use pluggable modules to further process the payloads, such as making metric generation, doing aggregations, applying transforms, or rate limiting, and etc. And it sends the data, again, it can send it upstream to another collector, or it can send the data for further processing or to data stores to SaaS providers like Scout. So where does that leave Scout with the observability piece in mind and what's our roadmap for 2022? So the Scout product roadmap for 2022 is transforming from application performance monitoring to application performance observability and eventually into full stack observability. In Q4 of 2021, we released the external services feature of Scout that allows you to understand and monitor requests to external services through HTTP API calls to third parties or the other microservices that starts to get at that bit of that observability piece as far as the external services go within the existing Scout platform. Last year we actually started building a new platform specifically geared towards being able to do the application performance observability utilizing open telemetry in 2022. So we're expanding from application performance monitoring to application performance observability, being able to ingest high, just gigantic amounts of trace data from applications in order to provide that observability piece with all of that contextual information gathered from open telemetry instrumentation and allow you to steep insights into your application based on that data. And then later this year we're going to move beyond application performance observability into full stack observability. Things like logging and events will be able to hook up and gather this open telemetry data in from further systems and not just applications but systems and infrastructure that emits open telemetry metrics, logs, and even traces outside of the traditionally APM applications that we've so far focused on. And then observability 1.0 by the end of this year, we have a very aggressive timeline of becoming out of beta by the end of this year. So open telemetry, how it's changing data collection in Scout. So right now the open telemetry or the Scout agent fits in your application and it's an in process thread that collects telemetry data from the instrumentation that we've written ourselves. It collects metrics for every single transaction and sends that to Scout. And then in addition to those aggregated metrics, it picks out interesting traces, detailed traces that are selected from an algorithm that give you some insight. The starting is the beginning of some of observability within your application to what may be slow or where the pain points of the users may be. But the difference between the way the Scout agent works now and the shift to open telemetry is now with open telemetry, the open telemetry instrumentation will sit within your application and traces of every single transaction will be sent to Scout. And Scout's value add and what we need to surface for application performance observability lies in what insights and what we can automatically provide to your developers or to your performance engineers about your applications without overloading you or requiring you to build any dashboards to get those insights. So that's going to be the challenge of observability in the face of just enormous amounts of data collection. So just like the Hubble telescope changed the observability of our universe, open telemetry is a very large piece of how we can enable observability within our applications and infrastructure over the next few years. What may be a dark spot, a blind spot in your application performance now will change with the addition of open telemetry. So just like the Hubble telescope zooming into the darkest spot that we could find from our observable galaxy, we can now see that it's not dark, it's not empty, it's actually full of these whole galaxies that are now observable, invisible that we thought and never knew that were there before. And we can observe these with the modern instrumentation. So this is what observability or what open telemetry is bringing to the open, the observability space and monitoring space. So open telemetry is a somewhat new project over the last two years and the tracing portion of open telemetry is actually stable and being released as stable within the major languages now. JavaScript, Ruby, Java, Go, those languages have or are approaching stable releases for the API SDK and the protocol. Open telemetry is a timeline for metrics and the longing follow within this year and early next. They're working on stabilizing actually the protocol and API for metrics is already stable. The SDKs and feature freeze and now you're starting to see development being worked towards and on within these languages for the API and the SDK for metrics. And then after that will come longing so that the whole of the open telemetry project encompasses these three traditional pillars of observability. And our development follows the open telemetry timeline very closely. We're not trying to build a vendor implementation on top of the open telemetry project. We're not trying to put any shims in between us and the open telemetry. We don't want to, we want to be part of the open telemetry ecosystem and the contributors to open telemetry. So that's available for everybody. And it's not, again, going from the walled garden of traditional APM vendors or logging vendors into an open standard and then eventually, you know, five or so years from now. Migrates its way back to being walled, walled garden type thing. We want to, this is this open telemetry project is super, super important. And it is changing the game of monitoring and observability, logging all these larger vendors. So we want to keep it that way. So what does this mean? If you're a current Scout customer, it means the choice is yours. You can remain on the existing Scout APM product and we will continue to support and improve our existing Scout platform in 2022 and beyond. You can choose to go hybrid. You can have your existing applications remain on Scout APM and put any new applications that you hook up to Scout on the observability platform. Or you can mix and match your use cases as desired. The great thing is you can run both the existing Scout agent and the open telemetry instrumentation side by side. So you can run both at the same time and you can time it to happen when you want. It can be incremental to switch over as our new platform becomes more mature and adds features for things that you're already getting in the Scout platform. So in conclusion, kind of the observability is coming soon to the theater near you. If you want to learn more about observability within Scout, sign up for our newsletter. It's scouts.atm.com slash observability. And you can stay informed, opt into the beta of the new platform and just stay up to date with Scout's evolution into application performance observability and eventually full stack observability through this newsletter.