 Hello everybody, welcome to episode 2 of Cloud Native Classroom here on Cloud Native TV. My name is Kat Cosgrove, and before we get started, I have to warn you that this is an official CNCF livestream event, which means that the CNCF code of conduct is in force here. So be nice, don't say anything crude, and just generally behave. But we are watching the Twitch chat, so if you have questions for me or for my guest or just in general about whatever, let a rip, we will see them, and we will be happy to answer you if we can. I am joined today by Ted, one of the founders of OpenTelemetry, who's here to help me understand what observability means, and also how OpenTelemetry can help. How you doing, Ted? Doing great. Nice to meet you, Kat. Great to be on the show. Glad to have you. It's a big help. There are so many things in the CNCF sandbox. I'm a CNCF ambassador, and I can't keep track of them. I don't know if you're doing any better as somebody who actually supports one of these things. But what actually does observability mean in a Kubernetes context? What problem are we solving here when we talk about all of these different observability tools? Yeah, so observability means monitoring, despite all the hype. That part hasn't changed. It means when you're running your system, you're going to have problems, and then you're going to have to fix those problems. Observability is the piece between hearing that there's a problem, investigating the problem, and then you've got to go and solve it. So it's the hearing about the problem and investigating it part. In order to do that, you need to have some kind of signal coming out of your system to see what it's doing. And I say it's like it hasn't really changed because running your system hasn't really changed the same bugs that there ever were. In that sense, observability is not new, but there is some new tooling available. And that's actually what OpenTelemetry provides. OpenTelemetry is the part of an observability system. You can think of observability as two parts. There's generating the data and sending it somewhere. So that's telemetry. Like look up the Webster definition of telemetry. It's sending signals and observations about a remote object to somewhere where you can analyze it. So OpenTelemetry is just generating that data and sending it somewhere. It's not analyzing the data. So analyzing the data sold separately. And that's actually a new thing though. Because in the past, usually someone would make a tool, whether it's like a closed source thing like AppDynamics or New Relic or it's an open source thing like Prometheus. It's like cool. You make your analysis tool and then you need to offer people instrumentation packages so they can go out and generate metrics and logs and things. And so it's kind of like a unified stack. You generate the data to send it to the thing that analyzes the data. And that creates a lot of vendor lock-in in a way that's really pernicious. Even with open source stuff, it's not just like vendor for money issue. It's just instrumenting your system is what we call a cross-cutting concern. In other words, you take all of those little log API calls and you just sprinkle them everywhere. So you end up with just approximately a hajillion logs and metrics calls just all of your system. And then if you want to use a different tool to analyze that data, it's like we have to go re-instrument all of that stuff. And so that's kind of like one of the core problems we were looking at. And then the other issue is this sort of siloed approach we've had. Because every tool analyzed one type of data. Maybe it's a metrics tool that makes metrics dashboards. Maybe it's a logging tool that stores your logs and lets you search through them. Maybe it's some kind of profiling tool. Everyone makes like their tool and then the instrumentation for that. And a side effect of that is all of this data is siloed. It's not good because the reality is we don't use these tools separately. When you're trying to observe a system, you have this cycle you go through of first noticing you have like an alert, like a metric. Here, I've got, can we share my screen? I've actually got like some slides about this. So you go full screen on me. There we go. So like the way we really tend to do this is like some metric goes squiggly. And you get an alert. You want to know why it's squiggly. And so what you do is you look at your dashboard full of metrics. You kind of squint at them. And then you try to figure out which other metrics went squiggly at the same time. Like in the past, I've literally like, I will take a ruler or a piece of paper or something and just line it up on the dashboard and be like, what other metrics went squiggly at the same time? That's the way you're phrasing this a lot. Because it's like, it's not, let's not be highfalutin about this. Like that's, we're just like, this went squiggly. This other thing went squiggly. Okay. Those seem to correlate. What might that mean? And you start thinking, well, it might mean this or that. And so you start going through your logs to see like, well, what are the transactions? Like what are the chain of events that may have caused this problem? You might start looking at your configuration files and be like, it's something misconfigured somewhere is like, are these Kafka nodes configured differently from these other ones? And so you're trying to take all these different data sources, configuration data, all the resource data about all the different machines you're running. Logging data, aggregate data, like metrics. And you're trying to find correlations between all these different kinds of data. And once you've started to find some correlations, you start to build a guess about what the problem might be. And then at that point, you can try to go verify whether your guess is correct. And hopefully it is. And then you go roll out a fix. And what's difficult about this process is people tend to spend a lot of time trying to find those correlations. Yeah. So I mean, that's a really labor intensive process. It's really labor intensive. Just finding the logs, just finding the logs, for example. Literally, where are they? Where are the logs? When you've got like 100 machines and they're all handling thousands of requests at the same time, your logs are just this blizzard of stuff. And even if you have them in a system that can index them, what index are you actually going to use to find just the logs in that one transaction? Yeah, that's not really helpful for an actual, for a human. We do it, but we end up spending a lot of time when we're observing these systems actually just trying to find the data and collect it. It's one of those pain points that you get used to and you don't realize it's unnecessary. I was this is sort of like when good code formatting tools started to show up. I feel like the go programming language really kicked this into high gear. But where you just get used to the idea of just like format your code. You don't think about it. And then you go back to some setup where you have to do it yourself. And some of you are like, why am I pressing Tab all the time? And like, this is terrible. I don't want to do this. And so I feel like what you're getting out of open telemetry is the kind of correlations and indexes across these different types of signals so that you can feed it all into one tool that can cross index all this stuff. And if you have one tool that can do that and the data is actually structured into a graph, it's actually like properly structured data with a lot of semantic meaning. So you can see this is an HTTP client request. And this is like a Kafka queue and all of that. Then you can start applying some machine analysis to that data, which means the machines can start finding these correlations for you. And so you can say, look at some metrics and then be able to say, okay, I see the spike here in my metrics. What are example traces that are associated with this spike in my metric? I just want to see rather than try to guess and go figure it out, just show me the actual transactions that we're generating this metric, for example. So having all these different data types connected into a proper data, into a proper graph, lets you do this kind of automated analysis. People are going to try to sell this AI ops and it's just going to think for you. And that's not true. It's not going to do that very well. But it will be able to automate a lot of this digging around that currently has to go through a human brain. And once you get that off of your plate, it's really liberating. Because you can start testing your hypotheses very quickly. You don't have to think, well, if I want to check that, that means I've got to go dig around a bunch. It's going to take me 15 minutes to get all that data together and then grep through it. So I don't know. You get a little cautious about where you want to place your bets. And if you can just click through really quickly, then you're spending much more of your time actually just trying to analyze the data and making guesses and verifying them. So that's one of the big value propositions I think open telemetry is bringing. And by doing that in an open source way, we're essentially trying to create a standard by getting all of the big players on board to agree they're all going to generate and consume this data. And doing it a way that's stable and neutral enough with the right kind of dependency chain stuff which I can dig into. But basically we've made it potentially consumable for open source libraries as well. So if you have a library that's going to get shared in a bunch of different systems like a web framework or a database client, you can actually instrument that library yourself with open telemetry. And then when it plugs into an application with other libraries using open telemetry and the application itself using open telemetry, they all automatically start talking to each other. That's actually really rad because I'm deeply lazy, incredibly lazy. I'm the flavor of super lazy engineer where I will spend a bunch of extra time at the beginning of a project to wire up things that enable me to do nothing later on. Or at least do less busy work, do fewer boring repetitive things. So this is appealing to the lazy part of my brain in a pretty big way. Yeah, I think that aspect is going to be really helpful. And the whole thing is built on top of what's called distributed tracing, which used to be seen as a niche tool. But basically what distributed tracing is, is it's a context that follows your code as it's executing. So for people who have, I feel like this is CNCF land, so there's probably a number of Go developers. And Go is this explicit context that you just have to hand around like a jerk. I'm not a huge fan of the fact that you have to hand it around by hand. But I don't disagree with you. I don't write Go personally. Like I know it, but it's not. I actually, I have a YouTube video that's like a rant about context that just really digs into that particular issue. It is a bummer you have to pass around by hand, but it is great that there is a canonical context object. And everyone has agreed, like this is where you put your stuff. And so that allows you to hand things like open telemetry constructs that have to follow your code, just go into the context object. And those objects we call spans are what generate this graph. So they have what's called a trace ID. Could probably just draw this, but let's see. Let's see here. Let's do some drawing. While Ted's pulling up the drawing app, just a reminder to click the follow button so that you can A, use the chat, ask us questions. Please do ask us questions. Even if they're not related to open telemetry, if you have questions about Ted's virtual background or my hair, that's fine too. This is a real background. This is my living room. Oh my God, really? See, look, it's real. Your living room is incredible. Oh, thanks. Wow. It's just, it's so perfect. I assumed that it was a virtual background. That's wild. Wow, good job. Yeah, this is just, I just camp out in here all the time. You've designed it so well. I had to make it pretty. Yeah, very pretty. Anyway, follow us on Twitch so you can talk to us and also so that you know when we're live next. Let's do some drawing. Yeah, okay, Doug. So distributed tracing is basically, you know, you've got, let's say you've got two services here and there's like some operations that are occurring. Okay. So you have like Operation A calls Operation B, which calls Operation C. And then that makes like a little network request here. So this makes like a network request to this other service, which then has some more operations and then, you know, maybe it makes some more requests to other things and so on and so forth. So you've got this kind of chain of services and you've got the user coming in here, you know, clicking buy or whatever it is that's kicking all of this off. And so in all of these operations here, you have events, which are basically like logs. So you have all these little events happening. Boop, boop, boop. You know, request started, request finished, all of that kind of stuff. What you have with open telemetry is this concept called a span, which says all of these events are in one operation. So let's call that a span and a span has an ID. And all of these spans are connected to each other, where each span has a parent. So you have a parent ID. Sorry, my handwriting is terrible. No, it's fine. And then so that's the basis of your graph is you've got each, each thing's got an ID, it's got a parent, right? Basic graph. And then this whole overall graph has an ID for the entire transaction, which is called your trace ID. Okay. Somebody is asking what you're using to draw, by the way. Oh, I'm drawing in Photoshop right now. I like to draw up on procreate in iPad. That's like my favorite thing. If I'm on desktop, they pay for the Adobe thing. So I use it. Right. Yeah, of course. Yeah. So yeah, span ID, span, you got your span ID, your parent ID, and then your trace ID. Got it. And so this, this blob of data can then be associated with every log operation, which has actually, let's call these attributes. So like span attributes, metrics, and then anything else you might end up generating. So span attributes are things like the, the operation, the duration of the operation, the start time, and then a bunch of indexes that you might want to have on all of the different events. So for example, if you have an HTTP request, there's, you know, things like, you know, what's the method? What's the URL? What's the status code that was returned? All of that. Did this operation succeed? Did it fail? Those kind of attributes are collectively applied to all of the events that would occur within that particular operation. So we call those, we call those attributes. And then the logs themselves, these also have attributes. So it's attributes all the way down. Likewise, the metrics have attributes all the way down. Yeah. But if you make it, the point is once you get tracing set up, then anytime you make a log or anytime you make a metric, it automatically gets this, these IDs stapled onto it. And so this allows you, if you find one of these things, for example, like if you're looking at like the log that's like this thing blew up, like, so maybe it's, it's like kaboom here. Oh no. And you're like, okay, so I got this, I got this event, but I want to know like what, what happened here? Like what kicked this off? I may not have all the data I need here. I may want to know something that occurred somewhere else. For example, there might be some correlation that's happening, like this blew up. And you might be noticing every time this blew up, this thing has like project ID, you know, five. And you're like, wow, we're getting a bunch of errors and being able to quickly see like all the errors are coming from one project. That would, that would tell you a lot. Immediately. Yeah. That there is a problem specifically here. Yeah. Or latency has gone through the roof. Like I keep your Kafka queue is backing up and noticing like all of that delay is happening from Kafka node six. That kind of stuff is going to, to really rapidly, rapidly help. And I should mention, in addition to these spans, so this is the kind of transaction context. So we call this like, like this is, trace comp, here we go. So this is what we call trace context. We also have all of this stuff here, which are called resources. So resources are things like, you know, service name, you know, Kubernetes info, you know, what, you know, cloud info. So all of your resources and also config, like you can put all of your configuration information into this stuff. So you're able to, to kind of cross index, not just looking at the transactions, which are kind of like every time this runs what happens, you're also then looking at what are the services, what are the resources this transaction was associated with. And so having all of that data together lets you move around a lot. And this includes metrics, right? So if you generate a metric somewhere in here, that metric is going to automatically get associated with, you know, the machine that it was generated from. And then it's also going to, every time you say, count that metric, it's going to get associated with the transaction that caused that count. And so these are what are called exemplars. And that allows you to, to kind of, if you have a tool that will do this, bounce back and forth between like looking at your metrics and then just looking at the transactions that cost it. Wow, that's cool. So yeah. That's open telemetry in a nut cell. Yeah. And that's open telemetry in a nut cell. Is like, that's the value prop. The other value prop, like I said, is by doing this in like an open source standard approach. And by standard, I mean, we're convincing everyone to use it and we're developing it in a manner that's really focused on long-term stability. Like we're never going to ship a 2.0 of any of our stable interfaces once they become stable, you know, Microsoft is talking about baking open telemetry into Word and Windows and things like that. So you're talking about software that has a shelf life of like 20 years. So that's the sort of time scale of like stability and support we're thinking about. And that's what's going to allow open source software to be like, well, you know what? I could instrument myself rather than having instrumentation come as a plugin that kind of hooks in, which is how it currently works. You can say, well, I'm going to instrument my database client or my web framework myself. And then I'm going to ship a playbook to my users. So let them know I provided them all these configuration options to let them tune. And I'm providing them this observability data and I'm going to give them a playbook that says, you know, when you see these kinds of squigglies, it means, you know, you should tune these knobs. Right now playbooks are something SREs just make for themselves. But my hope is in the future, the people who write the software will be able to hand you the playbooks. So yeah, those are the big goals for the Open Telemetry Project. Those are big goals. But so this is something that like anybody can just take a crack at themselves, right? Like if somebody wants to go wire up their personal project with this, they can just let it rip. Yeah, tracing is stable. So it's totally fine. Once any look at any client that comes out, once it's hit 1.0, that means tracing is stable. We're working on the metrics API right now. So that'll be stable by the end of the year. And we're also working on on blogs. So you can you can essentially log using the tracing system today. Sick. Yeah. This has been this has been really, really great. This has been really helpful, especially for me. I hope it was helpful for the like 30-ish people watching us. But for me, it was useful because I didn't really understand what Open Telemetry did before this, which is the whole point of this show. Because I genuinely do not understand any of the projects I'm inviting on here. That way it's more authentic. And I don't have to like pretend to ask a stupid question. I just like authentically ask a stupid question. But we are running out of time. So before we go, is there anything you would like to shill for? Yeah. So I would like to shill in general for the Open Telemetry community. We're very, it's a very open community. We hang out on Slack, the CNCF Slack, any channel that starts Hotel Dash. There's an open general Open Telemetry channel. You can say hi, but we work in SIGs just like Kubernetes. And so all the SIGs have a channel. And the SIGs often meet every week on Zoom. So there's a calendar. And all of that information is in, if you go to our GitHub org, there's a repo called Community. And that has all the info. For KubeCon coming up, just sneak preview. We're going to try to do a live Open Telemetry community day. It won't be part of KubeCon because we want to not require a KubeCon ticket and to keep the cost low for attendees. So it'll probably, we have to work the details out, but it will be very cheap to attend. And it'll be like a one day on conference. Basically a big community get together because we haven't seen each other because of the pandemic. So have a look out for that. We should hopefully be announcing that soonish. So if you're thinking about going to KubeCon or just are in LA for some reason, you should come by and say hi. Brad, well, thank you so much for joining me. I really, really appreciate it. Everybody on Twitch who's still watching, the next show on Cloud Native TV is tomorrow at 1pm Pacific time. It's the fields tested with Kazzlin Fields. She is walking everybody through it while she deploys a personal blog on Kubernetes because what do we love? Overengineering simple things. I do at least. So go see Kazzlin tomorrow. I will be back week after next. And Ted, we'll see you on Twitter. Thanks so much. Bye.