 Okay. Well, hey, Martin, thanks so much for joining me on this fireside chat. You know, I'd love maybe a quick intro for all the folks who are joining us in Fluent Con, maybe who you are, what Chronosphere is, just a little bit about yourself. Yeah, for sure. Thanks for having me, Anurag, and really excited to be here today, chatting with you. A little bit about myself. So my name is Martin. I'm currently the CEO and co-founder of Chronosphere. We provide a hosted monitoring solution to companies adopting cloud native. We help these companies monitor their infrastructure, so primarily Kubernetes these days, monitor their applications, which is generally microservices oriented these days, and monitor their business as well in real time. And the core technology of our product really came out of Uber, and a lot of the open source observability projects that came out of Uber. And that's actually where I spent four years of my career before founding Chronosphere. I let a core part of the observability team there where we created projects such as M3, which is a distributed and scalable metric storage engine. It's compatible with Prometheus as long term storage there for metrics. We created Yeager, which is the CNCF graduated distributed tracing project. We actually completed the trifecta and created a logging platform as well internally, but unfortunately, we never open sourced it. So yeah, I spent a lot of time in my career in the observability space, both solving problems for Uber with these direct solutions, but also for the broader community via the open source channels as well. So a little bit about myself there. And again, really excited to be here today and looking forward to our chat. Awesome. And Yeager, alongside Fluentee as a graduated CNCF project. So, you know, awesome to hear about all this observability journey and Uber, what a small app, right? You know, I think as probably one of the leading folks in that observability space with Yeager and all these other projects. You know, I'm curious, what should companies be thinking about or what should users actually be thinking about when solving for observability? Yeah, it's a great question. And I love the fact that you put users in there, right? Because it is probably what the focus should be is what should the users be thinking about. And even if we think about who the users of observability are, that's been changing fairly rapidly, I'd say, right? So historically, perhaps the users or practitioners of observability and monitoring has been isolated to the SRE department or, you know, perhaps a core infrastructure team. However, if you think about modern development and application developer, they not only have to write and develop their application, they also have to test it, that to deploy it. They have to monitor in production. They have to remediate issues when it goes wrong. So really for us, observability, you know, is more than just a practice. It's probably like a cultural mindset, much like how DevOps is. And it really is something that all developers, you know, we are seeing embrace and adopt this culture and this mindset. We're seeing that more and more. So really the end users of observability is all the developers out there. And if you look at it from that perspective, they're really trying to optimize for one outcome, which is to know when something is wrong and to remediate that issue as quickly as possible. And ideally before end customers find out, right, whether they're external customers or other perhaps engineering teams and whatnot. And I think optimizing for that outcome of remediating particular issues in their applications, really three questions that we're trying to solve and answer for here. The first of which is can I get notified and how quickly can I get notified when something goes wrong? Because if you don't even know when something goes wrong, or your customers find out before you, that's really not a great place to be. And I'd say the second one is triage, right? So once you do get notified, something is wrong, figuring out what is the impact. Is it impacting all of my customers or just a subset of my customers? Is it one cluster or another? If you get working up in the middle of the night, is this something I have to deal with now or can I wait until the morning? So triaging the issue and knowing the impact is a fairly important question that we need to answer. And then the third step or phase or question that we do need to answer is can I root cause this issue? Can I find out the underlying root cause of the issue and really provide a fix here for it? So I think those are probably the three steps or phases I'd say that developers go through in achieving their outcome, which is to remediate the issue as quickly as possible. And for myself and for us, the team here at Chronosphere, that's what we think about when we think about observability and that's what we think and users should focus on. We do hear a lot of definitions out there where it's more, I would say, concentrated around perhaps the data types like metrics, traces and logs and the three pillars per se. Those data types are definitely important for sure. But they are the types of data we need to answer the question and arrive at our outcome. The data types by themselves don't really give you observability or better observability. Just taking the three of them off and say, hey, I have logs. I have metrics. I have traces. Doesn't necessarily mean you have observability or great observability and producing more of each of those data types that also doesn't lead to greater observability. So we do think that an outcome-based approach for the end user, which is the developer, is perhaps a better way to think about observability as a whole as opposed to a data approach there. Yeah, it makes a ton of sense. I think everyone's getting hammered in with these three pillars, logs, metrics, traces, you got a checkbox, all of them. And it's sometimes we just forget about the user in those cases. And you mentioned the three steps or phases. Are there things you'd recommend to users going about to meet those or maybe you can help clarify those pieces a little bit? Yeah, we'd love to double-click on that for sure. So when I mentioned the three phases, they are, I guess, phases or steps in the sense that they are generally sequential. You have to do one before the other. You do have to get notified that something is wrong before you can go and fix the issue for sure. I'll say that the best way to think about these three steps or phases is that it's still, again, optimizing for the outcome, which is remediation. And it's not necessary that you have to go through all three phases to remediate an issue. So if you think about it, if you're mid-deploying your service and you get notified that something is wrong, first course of action is probably to roll back that deploy instantly, right? And that could be your way of remediating the issue. You don't know the root cause there, but you've remedied the issue. You've avoided customer impact, I'd say. And that's really what you're trying to optimize for, right? So it's really about that. And perhaps at the notification phase, you can remediate instantly. So there's no need to go triage and to recall the issue. You can sort of do that after the fact. And I think that's important as well. You don't really want to be doing sort of root cause analysis live during the incident with the pressure of knowing that the business is down or impacted or anything like that. Perhaps, and I'd say also generally issues are introduced when we change a system. When we leave a system alone, generally there's not as many issues. And I would say the highest percentage of causes of issues is when we introduce change to a particular system. So that's when you get notified there that that resolution or step to remediation is fairly quick. The second one is triage. So, you know, definitely there are a bunch of situations where just being notified isn't enough. You haven't actively, you're not actively introducing a change to the system. So you do want to triage the issue and sense of knowing what is impacted, what is the impact to, you know, subsets of my, of my customer base, perhaps all of my customer base, perhaps, you know, you can isolate the issue down to one cluster or one availability zone or one region, right? And I think that helps, you know, how bad the issue is and how much sort of urgency you need to put into resolving the issue. But often we find that at that step, you can also remediate fairly easily without root cause analysis and can imagine, you know, most of our modern architectures are spread across multiple clusters, multiple availability zones, multiple regions. So a quick sort of resolution to that to remediation perhaps is to route your request around those impacted zones or clusters, right? If you know the issue is isolated to cluster A or to zone A, route your request away from that, such that again, you're, you're, you've remedied the issue and your customers are not impacted, you know, yet you have time to really figure things out, not under that time pressure. And then, you know, there are definitely occasionally those issues where, you know, you can't do any of the first two and you really have to get dug in and really dig at what the root cause is in production. Again, this is probably not the preferable one. You know, I'm sure the developers would prefer not to have to debug things live in production under the time pressure of actual impact to the business. But that does need to happen sometimes and when it does, you know, you really got to get dug in there, figure out what the root cause is, roll out a fix and remediate the issue that way. So, you know, they are sort of sequential and dependent on each other. But I think, you know, again, the outcome is remediation as quickly as possible. And from any of the steps, you can really achieve that outcome as quickly as possible. And I think that's the best way to think about those three phases there. Yeah, it helps a lot. And I think as you talk about remediation, is there going a little bit back to kind of the three pillars and the three data types? Or there are certain data types that make themselves or lend themselves to make it easier to remediate, right? Is like metrics or traces, is that going to help you move faster? Or, you know, how do the data types kind of relate back to this? Yeah, it's a great question. I think the short answer is yes. You know, there's obviously a lot of details in that. You know, I think when we think about the data types, and as I mentioned earlier, you know, just by having all three checked off doesn't lead to better observability in any way. And just having more of each data type doesn't really accomplish that or achieve that either. I think if you look at the outcome, which is remediation and the phases, there are particular data types that are better suited to particular phases, right? So if you think about notification, generally when you're trying to be notified or something, that notification is generally done in an aggregate view of all of your data or all of your requests, perhaps, right? So, you know, you're looking at how many requests am I getting per second? How many errors am I getting per second? What's the aggregate latency of those particular requests? So I think, you know, to answer or to help with the notification phase of the problem, metric data is generally a more optimal type of data to go and solve that problem. Doesn't necessarily mean it's the only one and it's the necessary one. But if you think about what you're trying to measure there, it's really an aggregate view of numerical data. You're counting things or you're measuring the latency. And generally metrics, which is, you know, values over time. And if you think about notification and alerting, you're generally checking those numerical values against a particular threshold. So, you know, I think metrics is perhaps better suited for the notification phase of things. Perhaps triage a little bit as well. If you think about triage, you're really trying to dig at one level deeper into the error count or into latency a little bit. And then perhaps having labels or tags on your metric data to better slice and dice by a particular cluster or an AC or region can help you with triage there for sure. And the actual individual request itself perhaps is not as required for those phases. But as you shift later into the phases and the process a little bit into more steeper triage and root cause analysis, I think that does lend itself to data types like logs and traces a little bit more because logs and traces are, you know, by default a little bit more of the both have a little bit more information there. And I think, you know, as that transitions, there are definitely, I would say, lots and traces perhaps more efficient at solving those particular phases of the problem. So, you know, I think all three types and I'm sure they exist for a particular reason, all three types are definitely useful throughout the whole process for sure. But in my mind, it's a little bit about optimizing for the particular phase. And now this doesn't mean that you need to have all three instrumented, right? And I think this is what is great about, you know, that the announcement earlier today about FluentBit and sort of extracting metric data off of logs because quite often it doesn't mean that you need to go an instrument for all three types of data but converting between one type to another type to optimize for the use case is a great advantage, I guess. And, you know, again, really happy to see FluentBit and FluentD go down that path in enabling those use cases. Yeah, let's talk a little bit about that. You know, I think, you know, we have a couple of sessions at FluentCon that are going to talk a little bit about more on the tracing side with FluentBit as well as just the metric side. We'd love to just get your thoughts about maybe metrics, logs, how these things all stem together and FluentBit's kind of new announcement there. Yeah, for sure. So, you know, I think there will be a session later today from Mike at Neymar Marcus. And if you look at that session, I don't want to ruin his session by any means but he was already, you know, via a custom plugin already generating metrics off of logs, right? So this is already happening even though there is first class support now in FluentBit which I think is great. Users were already doing this out of necessity. And if you look at Mike's use case, it really is to extract metric data off of the log so that he can alert off of this and get faster notifications because, again, metrics is perhaps a better optimal method of that phase than logs is, right? So, you know, I think this is already something that end users are leveraging already that the need is there already. I think what's great about the announcement and correct me if I'm wrong here, Anurag, but I believe the capability of extracting metrics from logs has existed in FluentBit for quite a while but the big announcement today is that it's happening in the metric extraction component. It's happening in Prometheus format and I think that is also really great for the industry as a whole, right? And if you take a step back and look at the monitoring and observability industry has been a huge shift towards open source standards, right? So, you know, for FluentD is the graduated sort of CNC project for logging, same for Prometheus for metrics and I think that the best part of that isn't that there is a solution. It's great that there is a standardized solution but that the standardization around the format of the data there, right? So, if you look at all of the projects in the CNC ecosystem, most of them are, in fact, if not all of them are emitting metric data. If we take metrics as an example, emitting metric data in the Prometheus format already. And I think that is hugely advantageous for the industry as a whole because it means from an end user perspective, from a developer perspective, you're not locked into one storage solution, whether that's a storage solution you're hosting yourself or whether that's like a vended storage solution, right? You can instrument in one way, in one protocol and every solution out there sort of supports it, right? And again, from the metrics perspective, if you look at the back ends, there's a lot of different solutions out there in addition to Prometheus. I mentioned M3 is one that we have been sourced out of Uber, but Cortex and Thannos is there and available as well. And I think the sort of movement to these standards and the power of that movement is also seen by all the vendors that are providing monitoring and metric space solutions as well in the sense that they all have to support Prometheus as a protocol now as well, right? And again, all of this, I think, is great for the end user and the industry at large because you're not locked into one technology as a back end or you're not locked into one vendor as a back end, which I think is great. So it's great to see that, you know, that ability is supported at first class now in Fluent Bit, but also that the exposition format is in the industry standard, which is Prometheus. And that's great to see, I would say. Awesome, awesome. Yeah, I think that's how we were thinking about it from the Fluent D side, the Fluent Bit side. How do we just conform with the standards? And I think, you know, a big upcoming project in this space is definitely open telemetry. And we announced some some earlier stuff today where we're saying, hey, we're going to have some integrations going on with the protocols they're building. We'd love to get maybe your take on the approach of that project, these projects together and just maybe your take on open telemetry. Yeah, we'd love to. I assume most folks watching this call are somewhat familiar with open telemetry, but if not, it is a collection of APIs and SDKs, again with the goal of standardizing the protocols and the clients that is generating all of this observability data. So, overall as a whole as a project, I love it because it's pushing the industry towards more open source standards for sure. I think if you look at open telemetry and the project really started off around having sort of standard client libraries for just retrace data, it expanded over time to include metric data. And I believe then, you know, the natural progression would be to expand it even further over time to include log data as well, right? And if you look down that path and they have, you know, similar, again, APIs and SDKs across all of the major programming languages, which, again, standardizes the instrumentation and the production of the data, which I think is great. If you look at it as a project, you know, it has support for tracing right now. Metric support is being added actively right now. And I think log may be coming soon. And I think that's going to be great for the industry moving forward. I think, you know, perhaps in a year or two, or perhaps even sooner than that, you will see a lot of the applications we are writing instrument in open telemetry from the beginning. And that's great. And in fact, it's not only just the three protocols there, it's a single client for all three types of data, at least two types of data right now, perhaps the third type down the line. I think that's great for new applications moving forward. But I think if you look at it from a practical lens of the things that we need to monitor today, there is so much existing instrumentation that is pretty I would say impractical to go back and re-instrument from a company's perspective for existing custom applications that have written themselves. Or sometimes it's impossible because you're pulling, you know, a dependent library or an upstream project that you're using and you don't really even have control on how those things are instrumented. So I do think that, you know, open telemetry hopefully is the future and that is the standard there. Eventually, I think taking it from a practical looking at it from a practical perspective I think projects are a great here because there is a lot of sort of back good support for existing protocols and existing instrumentation that is this today and it's tackling the problem not from a client perspective but from a processing perspective outside of the application itself. So I think, you know, that is a very different way of solving the problem. I think it's one that's going to be required as we sort of handle a transition that's going to be for multiple years. And I think, you know, with this design of FluentBit and FluentD where you're processing outside of the application itself, lends itself to have other advantages as well. So one of the companies we work with, they're called Tecton, when we talked about their use of FluentBit, what they were using it for was to actually augment the stream of log data that was coming out of the application itself with additional metadata around the environment that it's running in. So they were augmenting it with the cluster and the namespace of the Kubernetes cluster that they were running in. And I think that is a hugely powerful thing to be able to do. And I think they were using kind of the exact feature name, I believe it's called the rewrite tag feature or something like that in FluentBit. But I think, you know, that adds a bunch of fairly powerful additional value as well in the sense that, you know, now you can sort of standardize the additional metadata that you add to the streams which is always a hard problem to solve, right? You can imagine if you ask every developer to emit environment or the cluster name, who knows, you know, which way they're going to go do it in a standard format, there's going to be weird camel casing and all sorts of other things in there, right? So I think being able to do it in one centralized location is important. And again, sometimes if you look at it from the end user perspective, sometimes it's actually really hard if you're an application developer when you're writing your application and instrumenting your application. It's actually really hard to even know, hey, which cluster am I going to be running in? How am I going to go get that data? It's actually something that may not be possible from the application developer perspective. So I think this approach from FluentBit of sort of processing all of the existing streams of data coming out adds additional value there and unlocks a bunch of use cases for sure. And as you mentioned, I believe there will be support for all the protocols that open telemetry are going to be crafting and sanitizing as well. So, you know, sort of it's not an either or thing. It's a different pattern. I think both will really help the industry as a whole moving forward. Awesome. And I think both of us would probably agree that observability has changed significantly in the last three years. New projects, new protocols, new standards, you know, as someone who's at the forefront of this. What do you think the next three years hold? Like what is maybe the future in Martin Mao's mind of observability? Yeah, I'd say the future is always hard to predict for sure. But yeah, let's definitely have a crack at it. You know, I think if you look at the future, what I believe is, you know, this trend that I talked about at the beginning where every developer adopts this observability mindset, I think will continue, right? And I think the trend that there will be a huge transfer of both knowledge and skill set from that core SRE team, from the experts in these practices today to all developers everywhere. And I really hope that that transition continues to happen. I think as that transition happens, and this is hopefully as well as that transition happens, there is a focus on the outcome as opposed to the input or the data types as well. So I do see that happening over the next three years. So, you know, having the developers optimize for the outcome, which is remediation as quickly as possible. And I think if you assume that that is the direction things are moving in, I think there are a few implications of that or a few outputs of that. One of which is I think you're going to see a lot more of what we're seeing already where there is conversion between the three data types to optimize for the various phases that, you know, and really the developers are going to be optimizing for the various phases here. So do you think you'll see a lot more of this what we've seen today already of like transferring the data types to solve a particular phase and to remediate as quickly as possible. But not just that. I do also think that, you know, as part of this shift, I think there is also going to be, I think, better context being passed between each of the phases are there as well. So, you know, going from notification to triage to root cause analysis going through the three phases there, I think there'll be a focus and sort of innovation on passing more context throughout each of those. And, you know, one example I can give here is my co-founder Rob Skilinton gave a talk a couple of years ago at CubeCon where we showed how you could jump from a metric data point on a dashboard which you would use for notification and triage straight into the underlying request in the distributed trace system which you would use for root cause analysis, right? So really trying to not begin your search again as you go through the phases but really take the sort of effort you put in from each phase and then sort of use that more effectively in the next phase. So I do think that we'll see, you know, more things there. And actually I think that we'll also see sort of better integration between the tiers as well, right? And by tiers, I mean infrastructure and the application tier, right? So you can imagine and I think you may have alluded to this earlier today that, you know, that there are some plans on de-influent bit to also sort of collect infrastructure stats or infrastructure metrics from the hardware itself in addition to the streams of, let's say, log messages you're getting from applications, right? I think having that in one place perhaps unlocks a bunch of potentials as well in the sense that, you know, you can imagine if you know that there is a particular spike in infrastructure in CPU or disk usage around a particular time and you know which applications are emitting data from that particular instance because you're processing in a single place you can imagine that there could be better correlations there and that could sort of unlock a lot of value as well, right? So, you know, I think all of that will be sort of outputs of hopefully this sort of focus on outcome and driving better outcomes for the end user developers. I do think that there are going to be sort of other implications of this as well and I think one of the implications is that the amount of observability data that's going to be produced is going to continue to grow up into the right and I think that's going to probably grow and outpace the growth of infrastructure overall, right? Because as more developers adopt this mindset, there's going to be more instrumentation and more data produced and it's going to be great because, you know, we needed to get to better remediation but I think the unintended consequence of that is a lot more data being produced and, you know, it's not something that's just limited to, I would say the monitoring of the observability space we see this in large data as well and other industries as well and I would say, you know, that has implications for, you know, the central observability team or the SRE team that is managing and running all of the infrastructure and all of the observability tooling that the rest of the developers use and depend on and there's probably two large implications there. The first of which is I think that as the observability tooling becomes a more of an important tool, I would say in the toolset of developers the reliability of that system is going to become more important and, you know, this is coming from my experience at Uber where we built a hugely powerful metrics backend storage yet, you know, we couldn't prevent a single developer from writing a single line of code that inadvertently emitted high cardinality metrics or inadvertently emitted, you know, you can imagine a lot of log messages and took down the backend and impacted every other developer in the company, right? So I think there's going to be a lot more focus on reliability of those tools just because there's going to be a larger dependence on those tools. And the second of which is, you know, I do think that the monitoring data is as I mentioned earlier is going to outpace and grow at a much faster rate than, you know, our spend or our use of infrastructure and I think at a certain point, the central observability team or the SRE team is going to have to focus on how to, you know, implement best practices for the developers to understand the implications of the instrumentation and sort of optimize this data that's been produced to still, you know, solve the problem and optimize the outcome but perhaps not in a way that's like, hey, just produce as much data as you can and sort of hope for the best. So I think there will be a lot of focus on how to do with that side of the problem as well looking forward. But yeah, that's probably my best guess that we're going to see in the next three years here. Awesome. Yeah, I know I think everyone's going to be a part of it, right? If you're watching this, you are probably at the forefront of observability so really appreciate your answer and your honesty, right? It's maybe an expensive future. So with that, I think yeah, we can go ahead and close up and thank you so much again Martin for your time and your insights and we'll chat again soon. Yeah, thanks for having me today and hopefully next time we chat we can do it in person, fingers crossed.