 Hello and welcome, my name is Raghavan Srinivas and welcome to this observability channel on cloud native and Kubernetes, and really talk about some of the observability trends, both from a tactical perspective, how as a developer you can kind of start using them right now, but also from more a strategic perspective of how this is going to look long-term. Obviously, as all of us know, observability is probably as popular as Kubernetes these days, so a lot of interest on observability, a lot of activity, sometimes a lot of overlapping activity as well, but I have a great panel with me who can help clear those things and let me go around and kind of introduce the panel. Starting with myself, I work for DataStack, but I also represent InfoQ and I have a bunch of interests, but primarily in the inner loop of development, how can you do this over and over again, and observability is very key to the inner loop because the more signals you have, the better it is, the faster it is, and helps developer productivity. Hi, I'm Les Fangens, I'm a Principal Developer Advocate at honeycomb.io, which is a observability vendor, and I really enjoy helping site reliability engineers and software engineers become more productive in their daily jobs. Perfect. Hi, my name is Bartek Podka, and I'm Principal Software Engineer at ThreadHat. My role at ThreadHat is to really bind all those observability silos that we used to have, logging, metric, and tracing into combined and uniform experience, but I'm also interested in programming, in particular in Golang, I'm writing my book with O'Reilly about that, and I'm also active in the CNCF space as observability tech lead, stack observability tech lead, and Prometheus and Thanos maintainer. Perfect. Josh? Hi, I'm Josh Serret, I'm working at Google currently, Google Cloud, I'm responsible for telemetry collection within Google. I am at heart, I'm a programming language nerd, so I love Rust, I love Scala, I love like esoteric languages and really interesting things, I read lots of theories, but I really care about developer productivity, and that's why I'm into observability. I'm an open telemetry technical committee member, and maintain a lot of observability components, so it's pretty exciting. Friedrich? Hey, I'm Friedrich. I'm the founder of Puller Signals, we do continuous profiling, maybe we'll talk about that a little bit later, but my interests are anything in observability really, and kind of for the past five, six years I've worked on everything that's kind of the intersection of Kubernetes and observability, so I'm also a Prometheus maintainer, and so a lot of the things that touched both the Prometheus and Kubernetes ecosystem, I've probably had my hands on, so I'm a Prometheus maintainer, I'm also the tech lead for Kubernetes SICK instrumentation, and yeah, in the past life I was also a security researcher, but that I usually just observe from the distance these days, and yeah, anything distributed systems I'm interested in as well. Essentially what we are talking about is forensics in terms of observability, and with respect to that, what exactly does forensics mean in the context of observability, and how do you kind of prepare for the unknown unknowns, so if you can go around the table and talk about some maybe an example in your previous life or now, which kind of help you clarify what this exactly means, because a lot of people are honestly kind of wondering what this exactly means, observability for forensics. So Liz, do you want to go first? Sure, so I think the thing that really blew my mind and opened my eyes to the possibility of what observability could do was when I was a site reliability engineer for one of the Google storage systems, I got introduced to this idea that you could click on like an explorer from a metric type graph and look all the way down into what is the trace that exemplifies this behavior? How can we understand what happened under the hood? And this idea of not needing to write additional code, write additional instrumentation, and to have insights into something that already happened in the past, that really blew my mind and kind of got me started down this journey of thinking about unknown unknowns, thinking about how we answer questions that we didn't anticipate when we originally wrote the code. Anybody else want to take a stab? Hey Josh. I can take a stab. So for me also a Google example, I was maintaining a project on Google Assistant actually, and we were having issues where every time we demoed to our VP, it was very slow. And we needed to go figure out why specifically that device was very slow. And so actually we had observability signals in place to go kind of identify the subcomponent in this giant complex system of thousands and thousands of different subsystems where the particular problem could lie. And then we did some further instrumentation to investigate it. But for me the interesting thing about observability that I think is a hard balance is actionable observability. Like observability is always tied to you doing something that makes something better. And that's been kind of my focus with observability. And it's really interesting to see that kind of take place. We're able to drop our 99th percentile latency by about half via this exploration because we found a particular subsystem that actually just had one really bad behavior for a particular class of user. But it was hidden in the statistics. That's right. Maybe you have some thoughts on this? Yeah, I mean, I think a lot of what's already been said, I agree with. And there's a kind of a saying also in the Prometheus ecosystem that I think kind of it doesn't really just apply to metrics, but the saying is kind of instrument first, ask questions later. And I think all of the examples that we've heard so far already kind of apply to this. This data didn't come out of nowhere. Someone did go ahead and instrument this. But the really interesting thing that we could do because of that is in various situations, we could now pull up this data that we didn't even anticipate could have been important for that situation. And I think this is the powerful thing about observability. And maybe plus one from my side on both what Justin Friedrich said around actionable observability. You need to ask yourself if you want advance forensics. Do you really need that? Do you need to know about everything what is happening in your application? I believe that you can kind of define set of patterns, set of good methodologies like use golden signals or red method where every component has those basic instrumentation, either instrumented by an application code or using some magic tools like EVPF or some auto instrumentation outside. But at the end you have consistent statistic from each component that then you can reuse, you can rely on. But it doesn't mean you need to know about every single function call all of the time in the past. So I would say there's always a place for balance here. Perfect. Let me go to the next question, which is really, I know that there are a lot of technical efforts that are happening. But it's not all about technology. We know that. Sometimes it's about culture as well. But a couple of questions here again. One is Kubernetes with its adoption. Is it helping the open telemetry efforts or is it really not necessarily hurting but making it more complex? And the second part is how does the technology kind of fit into the culture if you can talk about both of those that we get. So let's start with Josh. Sure. My answer is going to be that I think Kubernetes is actually doing both. What Kubernetes does is it makes complex systems easier to create. And it kind of simplifies them so that you can actually build a huge distributed platform and just more easily manage it. And so what that does is it actually puts more pressure on observability to actually answer the complexity of these systems that are now easier to create. So observability is that natural solution of, okay, I have a system that's now more complex. How do I observe it? And Kubernetes is also trying to answer that challenge via various technologies. And if you look at the CNCF portfolio of observability solutions, I think you're seeing all the key ones in there. So I would say that a little bit of this is both. We're doing more complex things and we need more complex observability to answer that call. I think there are a couple of ways that Kubernetes makes it easier. In particular, the adoption of service mesh technologies alongside Kubernetes has really accelerated making distributed tracing available out of the box. I also feel like the easy ability to run agents as side cars in Kubernetes makes, again, adoption and standardization a lot easier. Okay. Yeah, I would agree, especially with the standardization aspect that Liz was just saying. I think Kubernetes essentially has given us a common language that we can talk about, right? And this is not even just about observability, but this is just kind of the effect of Kubernetes in general, right? Kubernetes is kind of the same wherever you implement it. And so namespaces are the same, pods are the same. And when you talk about this to people, everyone speaks the same language. And so that naturally kind of goes over into observability as well. And just having that common language allows us to standardize so many things, right? And I think this is extremely powerful as well. Yeah, I don't have very much to add on those words. It's essentially a huge opportunity from Kubernetes side to create abstractions that allows us to, you know, standardize metadata and the ways we observe stuff. So that's incredible. But the question is, Kubernetes helping standardizing observability effort can be also answered in obvious years because it makes things more complex. So it demands more observability. So people have to find something out. So I think it helps with innovation and just, you know, community effort in this space. So that's great as well. And I completely get the fact that, you know, standards definitely help here. Obviously, open telemetry is like the second most popular project in CNCF behind Kubernetes, right? We all know that. But really, you know, open telemetry is so broad and yet application development is so narrow, right? So in other words, if I'm a Perl programmer, right? And as strange as it may seem, there was some noise about kind of creating a Perl language working group, right? So how do you address these edge cases? You know, standardization is great, but, you know, unfortunately, we live in a world where, you know, everything is not standard, right? And what about the non-cubinities, language specific, Rust, Scala, Josh, I know that, you know, you are a Scala fan, right? How do you kind of support those different development platforms personally? Let's start with Bartek this time, if you have any thoughts. Sure. Yeah, I think it's a complex problem. And that's why we see a lot of effort from open telemetry to really thrive into supporting so many different languages and building communities around that. And I think this is really, really important. You know, one way is to just maybe reduce the number of languages to be used, right? That would be nice. But other part is that, you know, fortunately, fortunately, we are standardizing the data format. We are standardizing how backends looks like, how the query languages looks like, how alerting looks like. So we already abstracted away from instrumentation. That instrumentation is the only thing that has to happen in a specific languages, usually. So that's, you know, already smaller piece of work. And also, there are lots of tools that allows us to be very implementation code agnostic, like service measures, EBPF, which is emerging. All those are opportunities for any abstracting that away, so making sure there's less work there. So I think we are innovating in this space, too. Okay. Liz, do you want to go next? Yeah, sure. I think that we need to support developers in adding instrumentation to their code, right? Like no matter how good the automatic instrumentation is, there always going to be properties you may want to filter or break down by that are specific to your application. But yeah, so it's kind of this balance between easy and out-of-the-box, which service measures can do, but also, like, you know, yes, your most common language frameworks should support open telemetry or other open standards out of the box. And you should be able to kind of annotate things, you know, it should be unthinkable to check in code without instrumentation in the same way that it would be unthinkable to check in code without comments or tasks. But that requires all of those languages and frameworks to support the standards. Josh? Yeah, I think to double down on that discussion is there's an aspect of this that needs to target actually application developers with observability. Instrumentation, really rich instrumentation is expensive. And we're seeing solutions like EBPF that can make it simpler for things like traces from somebody who doesn't necessarily own an application or this out-of-the-box, you know, agent-based. But what I think, where open telemetry wants to be in five years is kind of targeting the underlying applications, underlying open source frameworks that people use to develop applications with built-in signals, with built-in standards of like, here's the bare minimum for observability. Let's make sure everybody's at that bare minimum. And that really takes targeting developers. The developers are providing the observability for operations as opposed to starting from the op side. So I think that's kind of the shift that we're seeing right now. And that's done through stability, stability and protocols, right? So Josh, to summarize, are you saying that from an ops perspective, observability is further along than from a developer perspective? Yeah, I think from a developer standpoint, there's a lot less unification or a lot less common standards. If you look at the Prometheus ecosystem, for example, there's a lot of things targeted at ops and targeted at adapting existing applications into Prometheus, as opposed to those applications directly providing, say, a Prometheus component or an open telemetry tracing exporter or something like that, right? I think we're going to start to see applications providing the observability as part of their own profile instead of having to look to the observability solutions for adapters. I think that's the shift that I'd like to see going forward. Yeah, this is coupled a lot to kind of this DevOps idea of shifting effort left, right? Of doing this earlier, getting clicker feedback cycles. I'm really curious, though, to hear from Bartek about kind of the Prometheus angle around kind of how has Prometheus been perceived in the past with regard to kind of ops and dev audiences? Yeah, I was going to ask the same question to Frederick. Let me give him a chance here. I know that he has also done Prometheus maintenance, right? And still maintaining it. So I think Josh basically said that you didn't take care of application developers. So how do you address that criticism? So I actually think it's kind of the other way around, right? I think Prometheus adoption happened because so many people build exporters and the like in process instrumentation followed. And so I think open telemetry is kind of on a similar path but kind of is starting immediately with the application instrumentation. I think we've talked about service meshes a couple of times today. I think that's kind of a parallel here where service meshes do the automatic thing that's kind of the parallel to the exporter and Prometheus I would say where someone else has done the work for you. It's maybe not perfect, but it gets you pretty far and it gets adoption going. And then the in process instrumentation is the thing that actually gives you really powerful insight. And so I think this is not necessarily clashing. I think there's parallels that we can see here, but I did want to pick up on the earlier point of the standardizing the protocols. I think this is actually incredibly powerful for all of those edge cases that you were talking about. For example, there was a really great case by Pincap where they noticed that the standard rust open telemetry library was just not performant enough for them. And they could cut away a bunch of use cases from that library to make it much more performant for their very special case. But they were still speaking the same protocol. And so they were able to tailor very closely to their use case. It's maybe not a general purpose library, but it's good enough for them. And that's why I'm such a big believer of open standards and the wire protocols being standardized. Maybe to add on top of that some small words. I think standardization helps because then people, developers can build their own client support. But I think the one maybe advantage what Prometheus used to have or already has still, I think the client is so much simpler because it's just HTTP endpoint versus open telemetry. You need to have much more complexity built to the client. So there's much more work. That's why it's so harder to do. Obviously, there are benefits of that, right? Of this push model. But one of the benefits of full model is just extremely easy to implement on the client side. So that was easier for community to build and adopt as well. So there might be some differences here. Great. The next question that I'm going to ask, which is, I think we kind of alluded to that is, what do you say to folks who think that emitting more data will help in the observability forensic, so to say? Really absorb more than what you need. Seems to make sense. But is there a point where it's too much data? And I think some of you alluded to this, you know, observability might get very expensive, right? Very new developers are, you know, since Edmunds kind of draw the line, you know, Venice too much, really too much. I think there are a couple of ways to solve this. One of which is that sampling can be really, really effective for ensuring that you are, for instance, keeping every error, but you're keeping only one out of every thousand successes to kind of get a representative sample of varying categories of data that you might care about. I think I worry a lot more instead about the amount of duplication. If people are trying to emit data both to, you know, a logging platform and a tracing platform and a metrics platform, right? Like at what point do we say, like, enough is enough, you know, more data is not necessarily helpful, right? It's better to have kind of fewer higher quality signals and kind of trying to get one of everything. You know, and I think that the developer experience that we're aiming for is kind of making it so that developers don't have to worry too much about adding instrumentation. It should be a consideration of, you know, how do we filter it down the line for how we keep it cost effective. Okay. I agree. Do you want to go next? I want to do, like, plus to what Liz said around the tooling side. So developer should not really decide about that. It should build instrumentation and, you know, negotiate with sysadmin and through standard tooling into how much data we want to get from, in particular moment, from particle services. So ideally everything is dynamic. I wish I could just enable some, you know, more sampling on tracing side just for one hour when I want to debug right now, right? So you have lots of flexibility using this kind of tooling. So that's the answer on the developer part. And I really, I would love, I love the idea of bringing this logging metric and tracing signals. All of this is a little bit duplicated data. So if we could just reduce and maybe decide either you use logging or you use just tracing. So you're ready for three things. You have only two. That's already a significant amount and some mitigation to this problem of overall too much data. I'd like to echo into that with what we see in open telemetry instrumentation right now is there's a goal to have an instrumentation API that will extract all of traces logs and metrics kind of with a single hook if you will. But the other problem that we have is how much attributes and kind of data to attach to the telemetry that you're generating and how configurable does that need to be, right? So to some extent the instrumentation because it's expensive to write initially should have everything instrumented and everything as possible to extract. But you don't want to pay that cost all the time. You know you only want to turn that on when you need it because if we did everything it'd be way too expensive. So I think what you'll see going forward is exactly what Bartek's talking about. There'll be a baseline that's good enough for a lot of your standard questions and you'll be able to go in and richly configure more and more data coming out going forward. I think the synergy of feature flagging and observability is really really powerful, right? To be able to configure your debug levels almost, whether it be your logging debug levels or whether it be on the metric side adjusting the frequency of collection or on the tracing side saying do you really need all of your HTTP client? Hey, I issued a DNS request, right? I think that we're not talking about configuring on and off things like developer supplied application key value pairs, right? Those are going to be important regardless but I think it's kind of those finer grained like, you know, do you really need to to respond for everything? Okay, Frederick, any supporting talks or, you know, I don't think I have much more to add but I do think this is a really difficult thing and I don't think there is a clear answer to this even because even if we had tools like we just described where we can switch things on and off, we don't necessarily have that for the past then, right? And so that's kind of what we talked about observability being valuable for earlier already. So it's kind of a paradox almost, right? So I think it's a really difficult thing to make the right decision on. I would say there's also definitely a progression of people and their observability data. What I've seen often, I maintain a project called Kube Prometheus where there's a very large amount of data out of the box in your Prometheus that is collected from Kubernetes components and frankly, most users that are just starting out don't understand much of this data, right? And as they use it more, they start to start to figure out which data is useful to them. So I think there are definitely also just sometimes you need to be a little bit pragmatic about, do I even understand all of this data? And if not, maybe it's not all that useful for you right now. And then you kind of, as you learn more, you may add more data. I think this can be useful as well. Perfect. I had a couple of questions more, but I think we are kind of hitting the end. So I will kind of combine those two, although it's probably not a good idea to combine questions, but I'm going to do it anyway, right? For somebody who's kind of starting new on observability, how do you kind of get them to walk, crawl, run, right? The second part is, what's the most difficult, unsolved problem in observability that need to be solved in the next five years? So maybe if you don't want to look five years down the road, maybe two years down the road, right? I will go around, start with Friedrich, if you don't mind. Yeah. So I think the biggest challenge I see for us going forward is mostly a cultural one, I think. First of all, observability is definitely still growing and a lot of companies still aren't, even when I talk to people in the community, some companies still don't do any monitoring, right? It's pretty crazy. We're all in kind of a bubble where we think everybody is an expert in all of these things, and I think there's still a lot more education of the market that needs to be done. And I think there's also kind of this culture change that needs to happen in the entire community where, yes, like logs, metrics, and traces are really useful signals, but I think, and this is partly why I found mid-polar signals, I think there's so much more data out there that we may be missing out on that can shed a light on different nuances of our systems, right? Like logs, metrics, and traces have certainly been extremely valuable, but I think it's primarily humans liking the number three that we kind of settled on this, right? I think there's so much more useful data out there that can be useful to understand running systems. Josh, you want to go next? Yeah, sure. I think that you need to start where you are, and I think there's a lot of good adaptability. The space is pretty complex right now. There's a lot of decisions you have to make, so am I going to do metrics? Am I going to do logs? Am I going to do traces? Maybe pick one. And my recommendation is actually logs and metrics or logs and traces, specifically because you can monitor kind of SLOs and understand when systems are going slow and then dive into root cause, and that's a decent start, right? And then from there kind of expand. If you talk about crawl, walk, run, crawling is just can I deal with a system that goes down? And running is can I evaluate whether or not a feature I just released improved my business, right? And like to go from A to B is a journey, and you shouldn't expect to be able to answer B just because you have observability. You need to actually go through all the steps to go from just doing kind of ops-based reactionary actions to getting to those business-based. I can actually understand features that I released and how they impact my bottom line with the same kind of instrumentation. Anyway, great. Bartek. Yeah, so from my side, I definitely agree with Josh and Frederick, but around the unsolved problems, I see big data kind of being a problem and kind of distribution of this data. So we are in the KubeCon conference now and talking about clusters, but already we see so much variety of different use cases where clusters are so small and running in some IoT devices around the globe. And they are under like unreliable networks or even robots or microwaves with stuff. And I wonder if one of the unsolved problems is to make sure that the observability for those things, we have tools that also work for those cases for something outside of the cloud bubble, I would say. And we're talking about the large amount of data and different collection pipelines and who knows, maybe different signals, but being able to use similar systems, similar methodologies that everyone knows that works for those variety of different cases. And with amount of the data that we collect every day, this requires lots of innovation and work. Liz, I'll give you the last word. Yeah, I think actually the thing with regard to crawling that many people are not doing yet is shortening their release cycles. It's really challenging to get good observability if you're not able to get the instrumentation that you need deployed into production alongside your code. Like if it's taking months for you to release new software, you need to focus on that first and not necessarily on observability would be my advice. With regard to kind of the future challenges, I think the main thing that we're grappling right now with is this proliferation of definitions of observability, kind of observability washing. And I really worry that the word observability is going to be confused in the market in the same way that the word DevOps gets applied to everything. And I think that'll really hold people back from achieving these best practices around being able to debug and known and known problems. So I think that's what we need to focus on as an industry over the next couple of years is really kind of standardizing. What does it mean? How do we do it? Thank you, Liz. I think that's a wrap. I really want to thank you all for listening to this panel. And I really want to thank all the panelists for their esteemed opinion. I know this is a space that's going to grow quite a bit. And again, thanks for listening and have a good rest of the conference.