 All right, I want to thank everyone who's joining us. Welcome to today's CNCF webinar, the what's and why's of distributed tracing. I am Libby Schultz and I'll be moderating today's webinar. We'd like to welcome our presenter today, Dave McAllister, senior technical evangelist at Splunk. A couple of housekeeping items before we get started. During the webinar, you're not able to talk as an attendee. There's a Q&A box at the bottom of your screen. Please feel free to drop your questions in there and we'll get to as many as we can at the end. Not the chat, be sure you're dropping it in the Q&A box. This is an official webinar of the CNCF and as such a subject to the CNCF code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct. Basically, please be respectful of all of your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinar page at www.cncf.io slash webinars. And with that, I will hand it over to Dave to kick off today's presentation. Take it away Dave. Thanks. Appreciate it. Hi everyone. I am Dave McAllister. I work for Splunk as a technical evangelist here. And at Splunk, we have this saying that every person is made up of a million points of data. So let me share just three real quick ones with you for here. I'm owned by three cats. So I'm very good at being ignored. I spent 10 years as a soccer ref. So I'm used to people disagreeing with me. In fact, I could pretty much guarantee that for any given moment, at least half of the players on the field would disagree with any call I made. And finally, I'm married. So I have a witness that I don't read minds. Please make use of the Q&A box that was just pointed out to you so that we can get your questions answered as well. So I'm going to be talking about tracing, but I'm going to start a little bit by talking about this concept of observability that has come into play recently here. And it's real obvious that data is this driving factor for observability. And part of that you'll find that there are lots of pieces. There are the three classes of data that we talk about metrics, traces and logs. But imagine, for instance, if you never got all the data, or you missed ephemeral data that was decided here, or your data showed you something you couldn't drill into, or you only saw the good stuff, or the bad stuff. So why does that really important? Well, part of it is, is that our world is changing, our application space is changing. And we're now are into a microscope style of architectures, where services are are independent, usually coupled loosely coupled, and can expand or contract as needed inside of this. And so while certain things have become easier, so CI CD has become easier in some ways. Simple testing has become easier. Testing, on the other hand, we can start getting the synthetics and start looking even at passive monitoring the room. But nonetheless, we now have a lot of moving parts. And when we throw this also into a cloud, we also can throw this into a hybrid cloud where part of it lives inside a data existing data center, as well as into a public cloud. Life becomes a little challenging. And that's part of what happens when we deal with this. So each of these pieces coming into the front end from your HPP may look at the cart service may look at all of these places. And so each of these pieces get touched in each transaction. Now the nice thing is that our tools tend to be smart enough to be able to map the transactions for us. But when we get into tracing, we start looking at what that means inside of here. So that simple little step through that I just did, for instance, the front end came in. Overall, you can call the environment took about one and three quarter seconds of which roughly slightly over one second was in the checkout service, which then breaks down into each of these pieces. And this is where tracing really excels showing us what's happening inside of the application on a service by service basis for a request of interest. And when we start looking into this a little bit more, we can start looking at that the transaction shows us what's happening in the microservice, which then considering the functionality within that microservice, as well as the expansion into the other microservices. So think of these as the trace is all of it grouped together, and the span is each individual portion, or any particular portion that we choose to measure. So those are the reasons in here, and that goes across time. These are all time series based. So our same functionality, the same application we're looking at. We see that the CPU is now pegged out at 100% and it's been that way for over five minutes. And that starts leading to some questions that are necessary to understand the CPU at 100% may not necessarily mean that there is something wrong, there may not be an error, but it can be a delay, it can be communications problems. And just as we do notice that at the same point in time, we had a new code push the front end version changed microservices services change all the time. We mean companies where they push 100 different versions of different microservices on a daily basis, and I suspect some of you all may even be exceeding those limits for here. You can see that we've seen an event that that's caused potential happening. And what the actual functionality is showing is that we've actually seen a service, not available showing up as the far end of this thing, HTTP 503 error codes that are showing up in the shipping service. So when we start tracing this back, and this is what tracing is extremely good at is we'll be able to go from the error structure, all the way back to the underlying causes that happened there. In this case, something's wrong with our latest push auto remediation remediate by rolling back to a formerly known good version, and then start digging into the underlying problem. What this really starts leading to is that each of these pieces in turn get impacted, and they will all show up in your overall environment. So, shorthand observability is not the Microsoft. It's the clarity of the slide and the microscope is what the data can tell you in our old world with with black box monitoring or even far enough back where do we go into the data center and make sure all the little servers had their lights blinking at us. This is now we need to get more information and more details under that. And to do that, we need to be able to collect our data collection isn't also beginning to change a little bit because of the nature of our underlying structures here. We now need standards based agents. We can't live with with data that's coming in from a prior to your basis or being going out in a proprietary basis. We need cloud integration so that it understands how to deal with the public clouds, the private clouds, even into our legacy centers inside of here. We need the code to be auto instrumented. There's been a huge amount of discussion about whether you should should hand hard code everything into an application. But I think, again, think about how often services can be rolled up across this incredibly large environment now. And if every time we changed it we had to rebuild manual instrumentation within our code base, we'd be able to have a bit of a headache over time. Likewise, we want to support all the developer frameworks that make sense. We want any code at any time. And honestly, as we expand the scale we don't want to have any limits on how we can slice and dice this data to find out what we're interested in. Now, we've seen a lot of this metrics is a pretty well understood issue logs have been around for forever. If you give far enough back into the history of computing you'll run into a concept called Wolf fencing, which originally led to the original concepts of applications right rates inside. But when we look at this it really started with open tracing another CNC F project here open tracing defined how we could make use of tracing and the information. It's the second product that came out from Google in particular open census came up and the two were in the same area but not quite the same. And then fortunately, those two groups got together, I did create a new project open to lemon tree. And this is a both open tracing and open census, including backward compatibility across so that you can actually make use of your existing environments as well as what's coming out next with open telemetry. So think about this data coming in is really about the telemetry and the telemetry verticals are quite often. In many places, I mentioned for instance these three classes of data sometimes called the three pillars of data here tracing metrics and logs. But we now want those for every language of interest. We want the canonical implementations also for every language of interest. We want the data structure to be set up so that it is standard across all these environments and can be pushed in multiple methods inside of here. And then the formats of how things come together. So the wire formats the trace contacts, the data structures themselves are all standard so we can interrupt between them. And these are thinking of these the layers of what's going on of the activity for each of the verticals. Each of these verticals may have a different approach, but each of these verticals must have each of these layers. Each of these layers expand out rather large so for each language. It's easy to say there's too many languages to do this but if we look at this and say we're going to do the major languages and the major data structures, that's probably a great place to start, because we're also open. If you are interested in doing a particular one, it's available to do that as well. The nice thing is open telemetry is actually addressing these categories and the layers and are working to cross all of those tracing is definitely the most advanced. It started with existing tracing functionality from open tracing and open telemetry. Open telemetry has the most advancement. Betricks is a close follower to that and logs will get a little bit more into later are becoming the next direction. So open telemetry not only supports all of the telemetry verticals and things, but it also supports all three classes in one. You know, open telemetry is the second most active project in CNCF today, only by behind communities itself. So let's get into a little bit about the heart of what we're talking about here. And that's, so what do you want to trace tracing incredibly powerful, but it does take a certain amount of effort on your to make it useful to you. When we start looking at this, we'd start by saying what problems are we trying to solve. And by the way, this is a very small subset of potential answers inside of here. In the example I showed earlier, we're trying to solve a performance issue, things running slow, or we're trying to resolve an error issue. Why am I seeing errors being reported from a service. And when we start looking at this, we want to make sure that we are responding as quickly as possible at the same point in time. Meantime to resolution the meantime to detection need to be as quick as possible. Errors need to be solved as soon as possible. A recent Google study they updated for 2020 shows that roughly 3.6 seconds will cause a wait time will cause a bounce when sitting there on your site. So think about it this way we're an instant gratification society, and we need to make sure that we're getting resolution as fast as possible. The nice thing is about tracing works with the metrics and the logs. So they provide multiple sets of contacts in here. And the more data you can throw at this, the better your answers and the faster you're going to resolve the issues. However, when we look at the current methods that have prevailed beforehand, what we found was that there were some gaps that showed up here. And again, I'm sure that you can add to the gaps yourself and have additional But it was difficult to say what was observed to what is currently working inside of here. And we were having to collect data from different formats. So the metric format would be coming in and one, one wire or one format model, while the traces would be coming in and it's a different one, and the logs would coming in as you get another different one. So if you didn't have those you may have may think that you're, you're fine they have no gaps, but you may need to start looking at new things, you're developing new services to developing new code, and there's new functionality that's coming into play here. So when we start looking at those various things here, we can quickly arrive to the question that says, where am I, what problem I'm trying to solve. What is for I've got gaps, or I just wanted to get to the latest functions and features here. Next question is to start looking at what teams going to make use of tracing, not all teams necessarily make use of tracing in the same ways here. Tracing can be heavily used in dev environments for instance, tracing can be a benefit to the sre environments, tracing may not be as important to your network management, but it's still useful inside of here. The next part of that is, okay, so teams components and make use of this, who's available to do this work. Again, it's been made as simple as possible, and we'll look at auto instrumentation, but we really need to be able to make this as short as possible in getting started enough and running. And then, what's our short term goal, getting control of our system or understanding what's a long term goal, continuing to prove the customer experience as an exception. If you are already using it, are people making use of this, and if they're not, why aren't they doing this. And then finally, sort of why tracing now, which we kind of answered in the first place, which is our applications are changing, the nature of our infrastructure is changing, our environments are changing, and our customers expectations are all changing. So I want to step a little bit into the architecture of open telemetry itself. And my apologies, I've not updated this for the absolute latest set of meetings. And I'm sure we're going to hear a lot more about about what's going on here at open telemetry community day and cube con coming up next week. Breaking it down. The open telemetry environment has a set of components here is you can see them listed here and they are specifications. How do we deal with this functionality, the collector, where are we getting the information. How's it coming into here, client libraries that are coming into here. And then we also are incubating for traces and metrics were incubating or beta for traces and metrics incubating for logging, but auto instrumentation pieces are beginning to evolve. And in fact, already there with auto instrumentation. So when we start looking at these pieces here, it's easy to start the break down. So let's take a look. This is a open telemetry reference architecture here. Asian can on a host talk to a collector, hotel collector or the agent that agent can send information, either traces or metrics and traces off to the back ends of choice, whatever they are. It's agnostic. It doesn't really care what you're talking to on the back end. So the hotel library that can be part of your application can automatically talk to the agent to send the same patient for here. The hotel collector as direct can bring all that information and talk back to those various pieces. So the applications can now be bringing the data in from multiple sources and multiple places to multiple endpoints. And so this is a very powerful architecture. This is not a one size fits everybody. This is size to fit you correctly. This is important because tracing is fairly data intensive. There's a lot of information is flowing through your tracing environments, and you really need to keep as much of it as you possibly can. So you can answer questions that you didn't even think were going to be questions when you started for this. So the tracing side, which is the most advanced is W3C trace context or be inside of that is concept is to get the context of what's going on here the tracer functionality here. That is a series of calls the spans make up the trace. A trace is a series of spans, whether it's a single span or multiple spans is is part of the underlying structure here. The span can tell us additional pieces of information, what kind of communication is going on, what its attributes are it's key value pairs it's tags or it's metadata here events that have happened earlier on in the example. For instance, I showed that a service deployed, that would be an event that happened inside of here, and then links that we can drill off to here. We can also look at sampling functionality, we can look at span functionality, and we can export to the data anywhere into any functional wire format that we want via the open tracing protocol itself. Yeager zip can permit this. It doesn't matter. Those are the advantages that we started looking when we start bringing this tracing information into play here. There is something that's very important to understand when we start really getting into this, and that there is this thing called conventions semantic conventions here. In open telemetry spans can be created as many as places as you want to. And sometimes it can be a little bit of an overload of the amount of granularity that happens, but that's an issue that you need to undertake understand what's providing the best set of information for you here. We're looking at some well, well known and heavily used protocols like HTTP or database calls. We start trying to unify the structure of how those things get reported. And then we can think of this as semantic conventions. The conventions are, this is the standard way of doing this. And therefore, the open telemetry product will default to this methodology here. And that's like HTTP. We'll see a method. We'll see a status code, we'll see an error code that comes out as part of the HTTP semantic conventions. We'll also see databases, databases, no matter where they live. These are common pieces inside of a database functionality messaging systems, whether they are in QTT Kafka, you pick it, you got it there, or even function as a service, they're triggering functionality for the various serverless environments that are happening. These give us a common vocabulary so that when we start bringing all those things together, be there from multiple environments, the attributes that commonly grouped together are grouped together in the same language, so that we don't have to do additional work to try to understand what this data is telling us. The metrics spaces are similar, but because metrics themselves are different than tracing, there's a set of metric spaces here. Context is it wants to span and correlate so we want to be able to bring the information together, can quickly grasp the point of activity, whether it's over an aggregation of a period of time, or a specific point in time. In this context, this is that in terms of spans here. It really uses the ability to record a measurement. And the measurements themselves can be raw. I want to see a single value of a measure that's a measurement, or it can be a unit of values. Here's the name, here's the descriptor, and here's the units of value here. The metric is a measurement here. It is a single value of a measure, and it is counter. It can be a single point again. Or it can simply be looking at this state that's happening inside of here. And it works in key value pairs. All of this is time series dependent. This key value pair that can be metadata that goes with this. We tend to aggregate metrics over time. You know, again, it's not unusual for a large environment to have tens of thousands of active measurements coming into the back end systems for analysis and response. So if we were trying to watch that data stream one at a time, we probably wouldn't be very successful. And so we want to aggregate up aggregate common functionality up as well as aggregate over time periods that make the best sense for us. And that leads us to this last point, which is really about time, the time that's involved here is the time of the measurement, as well as the time of the aggregation that's involved. So if we start looking at those pieces, this becomes incredibly important to understand that metrics driven from the tracing data themselves. Allow us to look at time as a grouping, as well as dig into any specific point. Open telemetry is incredibly good at this and providing a common basis again for all of it. So the resource SDK, which is letting us have the entity that producing it so that's the representation. And in that we started looking at things like, where is this running. That's the environment things. Is it running on a host is it running inside of a container. What defines the computing environment, and that how did it get deployed. Did it get pushed by Kubernetes. Is it a manual process. And then what makes up that compute unit. So again, you can go from the Kubernetes world to pods to sorry worker nodes to pods to containers to process. And we need to understand how to drill into each of those pieces as well. The resource, the semantic conventions here are that all of these things get defined when we're bringing in the tracing. Now, or not to you is a call that you have to make in your environment. If you're not using Kubernetes for instance, you probably don't care about the community side here, but you probably still care about where it's running. What the running environment looks like at any given moment. That gives us the ability to start tying together the application, the communications and the infrastructure into one picture unique inside of this is the ability that we're starting with tracing. So we're starting by looking at the, the application, the user experience, and then driving down to what's affecting that user experience, without necessarily having to change our environments or change our structures or change our underlying language, if you will. So, most recently, logs have come into play here and logs are definitely in early incubating stage here, but it's really kind of important to know this. It is kind of unique and must be possible to map from existing log formats to this data model. And so, when we do this, we've got to make sure that we understand the logging structures that come into play here, as well as new information that may be coming in. So it's got to be semantically meaningful. It's got to be understandable and relate to the underlying structure here. And we have to be able to map between log formats, log format should be able to come in as one and be converted and go out as another, just like we can with collector bring in data from a wire format of one form and push it out as a different one. We need to be able to do this conversions so that we can continue to support things that exist today, as well as exist in the future. And this log translation is simple metal is that log format a should be able to brought into this and go out as log format be and log format be should be no worse than a reasonable direct translation of someone sitting down looking at log a and log b. We can also be able to go from log a to log b back to log a, without losing semantics capability or seeking mantis meaning meaning. We do look at three types of logs inside of your system formats. This is the very honestly these are the logs and events that are operating system puts out to us. And we really don't have a lot of control. We don't can't change the formats. We can't really affect what information is included. And so those are very stable environments and we know what they look like for here. But as I have third party applications here. You can think, you know, standard web servers Apache engine X, we may have some control over what that information includes. We may be able to customize the format. The first party applications, the ones that we are writing ourselves, and we have a reasonable amount of control over the logs and events that are generated. And what information we are getting from those. So, while there are other forms, each of these categories is considered as part of the log specification. Depending what the obstacle look like and how they'll get transmitted, as well as common third party applications or applications that we ourselves are writing. So from very stable to flexible to no limits model, we're looking at all of those capabilities for our logs. When we get into logs, they're really kind of two sets of things that feel inside of here. These are top level named fields. These are fields that we expect are going to her every single time. We're going to see a timestamp. We're probably going to see in logs the body of the log record. That's in software, we're going to see the source of the log record. And so we've defined these top level fields so that we can, again, unify our language. And so our language becomes part of the overall structure inside of here. Because we have a unified language. Now we can also relate those logs easily back to our trace data, as well as our metrics data. And when you start looking at this, this is why we're doing this. It's all around this evidence based debugging. And more likely you're going to go start with a high level metrics. You're going to drill down inside of here. You're probably going to note something is going a scans with your metric, and you're going to start going into your tracing information here. And then from the tracing information you're going to drill down to it and go to the right places here, even though we start looking at this tracing is the driving force behind all of these pieces. I'm just touching the collector a little bit. I mentioned this here. The collector really is an easy way of getting data in. Then diagnostic use receive process and export data here. It's decent default configurations you might want to treat it here it supports popular protocols. You can see what's going on with the format, even under under higher loads. It continues to run here, and it itself is observable. So you can see what's going on with the collector here. Since it's a single code base can be deployed as an agent can be to support as a collector, and it supports all three data classes. It allows us to offload from the application and changes to the application things like compression and equipment, as well as allowing us to have that common vocabulary the semantics coming in play here. And since it's language agnostic changes are easy. And so we can simply put the implementation into there, and then deal with it in that manner. It's been diagnostic. It's extensible. And you can find functionality for both sides by looking at the collector. You can see things that are in the core, as well as things that are in the community. This makes a great. We don't want data locking in. And so by making use of the collector, our traces are not going to be locked in. And our traces aren't locked in the rest of our data is also not locked in. So the architecture comes into play that goes in the receiver side can be wire formats. It could be Yeager, it could be OTLP can come in from a Korean DSP point and it can be exported similarly. You can choose where how you get it out here. Inside of that you can have multiple classes of processing. You can do a batch processing you can tell it to retry if it didn't see the data. You can go into streaming models here and you can build as many processes as you want. So you can actually deal with multiple functionality. So OTLP could go through and go out as Yeager. Or it could come in and go through a totally different thing and go out as Prometheus and OTLP is your decision on how you're exporting your data through this. So the nice thing is that while we have lots of libraries that are coming into play here. We also have the ability to start automatically. And this is the job example for instance here. It instruments known libraries with no code changes. It's a runtime environment here. It enforce it in here is to those semantic conventions that I mentioned easily. It's configured inside of here and it can coexist if you have something already implemented. There is a warning that does need to come out here. Don't use two different auto instrumentation solutions on the same service. So if you get conflicts in your results will be wildly skewed and probably not meaningful. So keep that in mind. But starting with Java, go grab the Java environment, dropping it into play, and you automatically start getting the trace information that open telemetry can give you. There's additional functionality that's coming out here. The rest of the client libraries are moving to the beta we have a number of them that are rock solid, the tracing environments pretty much a rock solid because they came out of existing production quality functionality here. We're looking at auto tracing functionality for the rest of the the languages that are capable of that. We want to add initial log support. We want to make it as easy as we can. I don't know if we're going to make this year, but we're pushing on getting the log support in here. We need all three other classes of data to go in here. And then we are working on documentation improvements. By the way, we just pushed a whole new series of documentation improvements. We are pushing to get more and more functionality would love to get if you're using it love to hear from you how you're using it. And finally, even though getting started is pretty easy, we want to make it really, really easy for you to get started with open telemetry and start seeing the values of what's going on inside of open telemetry. So, why are you even doing this. Okay, so we talked about the problems you're trying to solve. And I want to bring this back a little bit here. So imagine you're getting pays for an issue. And when that page comes in, you go through a series of steps here. What questions are you trying to need to ask yourself about what's going on inside of here. How do you filter out the noise so that you make sure that the data that you're talking about applies to the situation. How can you determine what isn't causing an issue, because removing pieces is quite often the fastest way to do an underlying cause here. And then how do you determine impact. Again, we talked a little bit about how our instant gratification society means people respond rapidly. They react rapidly to things that are too slow. And so keeping in mind how we determine impact. So we can get into the, the respond and resolve issues as fast as possible. So, you may not want to trace everything, you may not want to span everything, but you may want to end up with what are called necessary services that are showing up here. So we're going back to, why do you want to trace these things. Are you tracing for user happiness. Are you tracing to determine underlying along are you tracing to make sure communications go right. The easiest place to start is with the service boundaries, not the services themselves, but the end points between service calls, as well as calls to third parties, these inferred services that are calling. So you may not be able to instrument your database, but the open telemetry can see that you call the database and measure that functionality. This is an inner process, you don't have to do everything immediately. And that's the advantage again of using the libraries and auto instrumentation is that you can build this over time to make the use of what you want to hear. Keep in mind that this is a lot of information. Open telemetry or data is only as good as we can monitor, analyze and respond to it here. So keep in mind that it's possible to give yourself too much information. Keep that in mind when you're looking at what's going on underneath here that you don't go into this overload condition, where you spend all your time trying to, to break out the noise or to eliminate what's going wrong here. And you want teams that are going to make use of this to do it as close to free as you can get it. So looking at tracing should not be go spend three weeks in a class to learn what tracing is about. Tracing should be pretty intuitive from that viewpoint, and should be as close to being just something that's there. Once you've set up the environment, you should be able to get tracing into the hands of the right people. They can make use of it and that they can quickly understand what the value of it is. So, my next steps really are very simple here. I'd love for you to go check out more about open telemetry open telemetry.io. Please take a look at the documentation pages. Also, we have a getter that is very focused on open telemetry. There's a lot of other subgroups inside of here. The community is a great place to start for this. Finally, special interest groups. There's a ton of them out there as well. You can find a list of them on GitHub under open telemetry as well. Finally, feel free to submit a PR. There's a tag for a label for good first issues if you want to get it or help one if you wanted to join in a project as well. And with that, I'd like to thank you for our time today. And I am all turning back over. Thanks again. Thank you Dave. Does anyone have any questions they want to pop in the Q&A. Good. We have a little time left over. So, fire away. You must have answered all their questions Dave. All right. Remember the, Do you have anything else you want to add? I just, I really strongly encourage everybody to at least go take a look at this. This is probably the biggest single breakthrough in observability by unifying those data streams together. And we'd love to have you come and join us. So I would love for you to see you there. And in fact, I'll look for you on getter. All right. Well, thanks so much for a great presentation. And that is, since we have no questions, we can go ahead and wrap. Thank you everyone for joining us today. The webinar recording and slides will be on later online later today and we're looking forward to seeing you at a future CNCF webinar. So, have a good one and thanks so much. Thank you. Bye bye. Bye.