 Hi, this is your host Subhan Bhartiya and welcome to a brand new series T3M, a topic of this month for July and July is all about observability and who better to kick-start the series than Robert Schultz, CEO and co-founder of Reg and Rob. It's great to have you on the show again. Swap, it's a pleasure to be here. Yeah, we have been talking about observability for a very long time. I remember the early-day discussion as well. But it has come a long, long way and if you look at the whole ecosystem, there are a lot of projects like monitoring, tracing is there, logging is there. So I want to hear from you. First of all, what do we mean by observability in the cloud, cloud native Kubernetes or traditional IT space? And then we'll talk about how you have seen it has evolved over time. It's really important to understand how much observability has been transformed in the last couple of years because we started from some very simple log monitoring and tracing. But when containers were introduced and we ended up with production systems in which the actual running instances where those logs generated was so ephemeral, it became incredibly difficult for operators and developer teams to support infrastructure that the machine they went to go check on was no longer in existence, right? Just rotated out of the Kubernetes or the swarm, whichever system you were using, this high turn rate ephemeral environment created real challenges from a monitoring, logging, figuring out what was broken perspective. And we did end up needing a new word besides monitoring and logging, which is I think where observability came in. But observability does take it a step further. So it's not just this idea that, oh, I am doing a good job capturing my logs. What we've done as the intersecting trend of having development teams own their code in production, right? This faster desk to production pipeline also means that developers are injecting observability points, logging, tracing, different data they want to collect into their code. And so this isn't just, oh, can I look at my logs and find errors. It's actually can I instrument my code in a way that allows me to produce better results. And this is the intersection of several different trends all coming together in a very favorable way, combined with this idea that I now also need a system that is pulling data live from my code. I can't rely on log exhaust like I used to earlier this year. We also talked about the cloud native or Kubernetes specific complexity. So when we look at observability, does it add? Of course it does add another layer of complexity for DevOps teams. And if it does, what is the importance of observability for them? So they should deal with it. As you last time said, you know, the complexity is not going to go away. We have to learn how to deal with it. So talk about that aspect. It does add complexity, but it also does remove complexity. So one of the things that I was mentioning is this idea of trying to track down where something ran. And that's always been a challenge when you run a distributed system. So while it adds the complexity of having another infrastructure to collect this data and be able to do the real time trending and analysis, what it definitely eliminates here is this whole idea that I actually have to figure out where my code is running. And removing that from the team has been really a powerful component to make this work. Because I remember many times when I've been troubleshooting an application in the past where you have to spend a significant amount of time not just figuring out what the bug is, but where that trace is. Can you find the log? Does the log still exist? What server was it on? Did that move? You know, we've eliminated a lot of that part of the troubleshooting. And overall, that's actually a tremendous win from a DevOps team, developer teams, operational team, security teams across the board. There's a huge benefit to being able to aggregate together information across a distributed system. And if you look at just teams like DevOps teams, how hard is it? What are the challenges that, or what are the things that makes, you know, absolutely kind of harder for them if it does? One of the challenges with observability is that, you know, it definitely requires a additional platform. So it's not as simple as I just want to read logs and check things out. So you're definitely needing to use a platform and learn a platform on it. And you also have to have developers and access to code that can get instrumented. So, you know, a lot of companies now, Rack and Included, are doing things like Prometheus, you know, building Prometheus metrics and specs into our product. And so it's actually pretty easy to attach into observability points now in modern products. But if you're dealing with a product that doesn't have that built into it or older code that doesn't have that in it, then what you'll find is that these platforms don't have the hooks they need to really give you the quality results that you're looking for. And so there can be a bit of a cliff between, you know, older and legacy applications and new modern applications. I really despise legacy and modern in these cases, but observability enabled and pre-observability application bases. Perfect. Thank you. Now, let's also talk about how DevOps teams, they are trying to like, of course, solve some of these problems and how they can improve observability. DevOps is a challenge from an observability perspective because it doesn't have the entry points to, say, put together Prometheus metrics into systems. We get this request from customers who are looking at sort of the base digital rebar platform and seeing the Prometheus statistics and getting excited. But then they're like, well, how do I know if my scripts are working? Can I get observability statistics from the scripts? And most DevOps tools are not really designed to have that type of observability data output where you can just monitor it. Part of that is because they're really job or task specific and it's very hard to sort of tie in observability data on a job-by-job basis versus a running process basis. And so development teams really need to think about observability differently. It's critically important and we need to do a much, much better job with transparency, producing data, the types of things that we do for Rack and Customers will provide a lot more information about job starts and stops. We'll provide really detailed logging and information out of the gate on those things. We'll actually provide hooks where different triggers and events or alerts can be fired. So what you really need to think about is a lot of things that make a great observability system from an application stack are good injection points, places where you emit data, places where you emit events, places where you can hook in and track things that are going on. That's what makes a great observability system. So even if you don't have access to Prometheus types of logs or real-time logs out of the system, you can still do a tremendous amount to improve the transparency and observability of a DevOps infrastructure type of platform. And those are hallmarks. So looking at the events, being able to feed event activity, duration of events, how long it takes, get a scatter chart. We have a scatter chart for tasks so that you can actually start monitoring system performance in multiple ways and then expose that from an observability perspective. It's incredibly, incredibly effective to improve operational effectiveness. When we look at observability, also one of the aspects of observability is it also helps with improving security, but does observability itself kind of introduce any security risks for DevOps? It can. You have to be careful as you're emitting things from logs that you aren't emitting secure information or sensitive information in logs. And we have this conversation with customers all the time. There are definitely things that come in DevOps logs that you have to be sensitive to, passwords, IP address, system names. Ideally, you never code very carefully to avoid having any password leakage or credential leakage or certificate leakage. You really, really don't want those things in your logs. And so it's vitally important that you have a due diligence process where you look at those things and make sure that they're not being included in logs. And if you do need them in logs, there are places where we occasionally need them. You do it in a way that they are not on unless you need them on. So a debug flag, some type of trace protection so that you're not automatically emitting this type of information out of your system is absolutely critical. One of the things that we do that we feel is really important is because we really work to share and reuse automation as much as possible where it's Racken's mission to create a shared base of automation, that means that we can invest the time to do those types of protections and cleanups in ways that a lot of times one-off scripts or when people are just writing something they often don't think through the ramifications of an inopportune echo or an output or leaving a file on a system. So by having these shared code, we can invest the time to do the proper due diligence on security, on cleanup, and on resilience. I can't tell you enough, Swap, how important it is that shared code is going to be more secure code just because more people evaluate it, review it, and then invest the time to make it better. As you're already talking about how authority has evolved over time so what kind of evolution you're seeing now because Kubernetes in production we talk about cloud native a lot these days. Obserity, early days were more about awareness now once again we're talking about things in production and as things move into production once again they continue to evolve so how do you see it will evolve further? I do think and AI is touching a lot of things that we're doing my expectation is that the impact of artificial intelligence on these systems will mean that we're actually producing logs and doing log and metrics work that is designed to be fed into artificial intelligences for grooming even more than for people and so I think one of the things we do today is we look at how do we make a system more observable for humans one of the things that I would expect in the future is that we're going to start looking at how do I make my systems more observable for an AI to help diagnose and troubleshoot and find it which means in my expectation that we're going to have more voluminous logs we're going to have a lot more signals to sort through on the hopes that a machine learning model might actually be able to pick up a trend line faster than a human can humans have the opposite problem if we give them too much information it all becomes noise and so I do think that we're going to start crossing into the range where observability is going to be driving a machine learning algorithm as a first pass instead of a human as a first pass and that will be a ground shift in how things work and it will place a lot more strain on the platforms that are supporting observability and logging and monitoring How does RACN help their customers with observability? We think about this quite a bit in its core in how RACN designs our product digital rebar to make sure that operators of the platform can quickly figure out and see what's going on what we call a no black box mentality so if the system has magic in it it's very hard for operators to figure out what's going on even on a regular basis let alone when they're troubleshooting something and so we spend a lot of time making sure that the system produces a lot of very good usable information and that means that there's logs you can see what's going on you can track it there's normal generated statistics so that you can track even variants jobs might succeed but if they're operating slower that can be an indication of a problem we've added eventing so that events within an action can actually throw events or raise alerts that type of additional layer of interaction with the system really makes a tremendous difference in observability because we see this all the time a customer will be running a task and that task will if it starts to have a variance in how it performs will mean that there's a problem within their system that they need to address the timeouts are happening or retries are happening and that becomes you know an early signal of a much larger problem the other another example would be adding in event emissions into routine tasks so you might have a task that is operating normally but you want to be able to say you know before after during this action that happens I want you to an event so I could detect that I could catch it I could put additional logging in those types of actions for us have huge paybacks for customers a lot of times it's funny they don't realize how important that type of work is when they're just looking at the code they see those actions that we've embedded in because of our hard-earned experience in operating in the field and they see that as complexity the reality is it's actually built in resilience and so a lot of times I would encourage people to think this way when you're looking at a system that has complexity because you don't understand what it's doing a lot of times that complexity is our battle scars or learned resilience where the system is added behaviors that make it more supportable more observable and you just don't have the experience yet to see why that's been added into the system and fundamentally the product work that we do all reflects many of us have over 20 years of experience running and operating infrastructure and so the things that you see in our product are a direct reflection of that hard-earned experience not just from RACN but in our previous lives and roles Rob, thank you so much for taking time out today and talk about observability and as usual I would love to chat with you again, thank you I'm looking forward to it, thank you