 The role of the SRE is to define to the engineering team how an application and how a service should be monitored, how it should be observed, what are the parameters should be let out, what are the alerts should be let out, what is the alert routing that should go. Their job is not to receive the alerts. Their job is to help the engineering team take ownership of their service. Hi, this is your host, Sapli Bhartiya and welcome to T3M, our topic of this one. The topic of this month is SRE. And today we have with us once again, Asafi Ghal, co-founder and CTO of Logs.io, Asafi's Creative. Have you on the show again? Hi, nice to be here again. Before we kind of deep dive into this topic, just let's quickly remind our viewers what is Logs.io all about. So Logs.io is an observability solution. We are providing kind of like based on an open source observability, but we actually complete it to be more in line with how modern application is being developed, cloud-based, Kubernetes supported, and all that good stuff. We're focusing on the user experience. We're focusing on the ability to offer observability at a reasonable cost with a lot of technology in order to ensure optimization and ensure that you only store the right data that you need. When you look at reliability and observability are the end, they complement each other or they're like, you know, they're competing disciplines where teams are fighting. We are doing observability and you are doing reliability versus no. This is just a lot of things that overlap. Does the question make sense? Yeah, it does make sense and I think there's definitely some level of overlap and I think this is kind of like where we come in. I think a lot of the SREs or the site reliability engineers, they see the world, we kind of like call it in the horizontal way. They see the capacity being used. You see the Kubernetes cluster. They see the cloud utilization, but they don't see the impact of what they do on the applications. As opposed to that, the observability, which is more being looked from the developer side and the business side, is seeing the application. Like you rightfully said in the beginning, the whole goal of observability is I want to know if my application is serving my customers at the service level that I agree to. I don't care how many pods are being run on how many nodes are being deployed, what are the cloud infrastructure underneath. I don't care about that because all I need to know is that I can deliver the service that I can. I think this is kind of like where there is a little bit of a disconnect between the two of them and there is slightly kind of like a merge that's coming out in the previous year and more to come. So this is kind of like how we see it. You mentioned SLS. We have started talking about SLOs as well, so we live with Object2. Talk a bit about the slight difference between the two and when we look at reliability or observability, once again, looking at businesses, what makes more sense just SLS? Because hey, you know what? We have fulfilled that versus SLOs. I think for a business, what makes more sense is the SLOs. So I have my objectives and I think a lot of companies, we see it. They said that they do observability. What they do is just monitoring. And this is coming from the SRA. Well, this is coming from the DevOps engineering. It's not coming from the developer or the business. I think for a company and the way we see it for in order to company to kind of like migrate and make the transition to observability, they need to define SLOs. They need to build their observability system around their SLOs and they need to ensure that what they do meets their service level objectives. This is the only thing that they should care about. Unfortunately, a lot of companies, especially like in our domain, they come from the DevOps and the SRE and they monitor everything. They monitor every single CPU of every node of every pod of every machine that they're running and environments are getting very complex, every memory utilization. And at the end of the day, this results in a huge alert fatigue and they're not achieving the results. So how would I know if my CPU is spiking or my memory consumption is low or my pod is restarting if I'm hurting the business or not, which is the only thing I should care about? And what we're doing with companies is help them transition from that SRE from a DevOps organization to an observability and SLO organization. By doing that, we completely reduce their alert fatigue. We're getting to be more stable and hopefully making them also understand the trade-offs between their environments, the cost, the security and availability, which is kind of like a triangle that lives together. What are some other challenges that teams face depending on how far they are in their observability or reliability journey? Because cloud native is complicated, complex, so many moving parts. It's also tool and vendor sprawl is also there, which is actually good. Diversification of technology is good, but it can become intimidating for users as well. So talk about the challenges that you see customers often face. So obviously, the main challenge is the alert fatigue and the meantime to resolution, what's being called MTTR, the meantime to resolve an issue. And we've seen using kind of a car survey that over the years, the meantime to resolution just growing and you would expect it to go down because there are more tools today that can address this and the industry is progressing. And I think a lot of it is kind of like they go hand in hand because companies define so many alerts and so many things that they shouldn't care about. I mean, the Kubernetes cluster is already addressing restarts, already addressing scalability, already addressing a lot of the things that you should just let it do. You get to a point where you get alert fatigue. When you get alert fatigue, you have a higher probability of missing the alerts that you actually care about as opposed to just looking at these alerts and making sure that you address them at a timely manner. So I think this is where it starts. I think when you mention about the tools sprawl, it is an issue, but it's a solvable issue. And there are plenty of companies that we work with that use several tools for observability. It's all okay as long as they have a unified data collection, as long as they're using open telemetry, they're using something that make sure that the data is unified across all the platform. Because if I'm seeing an alert on my monitoring system for a specific environment, I need to be able to find that service in my logs without trying to guess how is that service called there and what does it do there. So just the ability to do it is really important. It all starts from the data collection without a proper strategy for data collection. The probability of setting up observability is really limited and it's a problem. What is the SCOVEN? Once again, we look at the whole system. Where does security come into play? And what is also, KubeCon is coming up. So I want to talk about it from the Kubernetes perspective as well. So obviously, Kubernetes is the adoption of Kubernetes is astonishing. It's faster and higher than the adoption of cloud technology. So Kubernetes is being adopted. And not only Kubernetes, all the different flavors, whether it's Kubernetes running on managed like an AKS or AKS Kubernetes running on a managed with Fargate with like a serverless technology that's running Kubernetes. And there are a lot of flavors. It does create a lot of flexibility, but where every time we create flexibility, you also create complexity. And the complexity is kind of like showing itself when it comes to troubleshooting. When there is a problem and there is complexity, you actually increase the time it takes you to resolve. And I think I'll repeat it again, maybe what I said in the beginning of the call. There are two ways to look at the environment. One of them is to look at it horizontally. I want to see the service. I want to see kind of like my infrastructure. I want to see how my application is laid out on my infrastructure. That's the way, like if I want to see the cluster, the Kubernetes cluster, I want to see what are all the pods that are laying out on this, how the nodes are being balanced, how much it costs and kind of like how it operates. The second way is to look at it vertically. So I'm an application owner. I have a hundred different pods that are running on like five different clusters. And I want to see all the relationship between all of them, because I want to see how they are performing. The way we see it security kind of like plays into the, into both of these, both of these teams, both from an infrastructure perspective, I want to make sure that what I'm deploying into my production environment is secure. And also as an application owner, a little bit less because as an application owner, my first and foremost responsibility is for availability and my service level objective, but also if I have a security, if I am introducing security vulnerabilities into the organization, I want to know about it. So I can make a decision how important it is. And I am the only one who has the capability of kind of like fixing it, if this is what I do. When it comes to let's just go back to SREs within teams, whose responsibility it is for monitoring of alert fatigue is already there. Because it's easy for us to say, hey, SRE teams are doing that. But in a lot of organizations, they don't have these kind of labels that you are SRE team, you are this team, you are that team. So talk a bit about from realistic perspective, what you're saying, whose responsibility it is, what kind of cultural changes that are needed within organizations, so that we are looking at things whether it's security, whether reliability, whether it's observity from a holistic perspective. I think a lot some organization that we see, they're having a hard time transitioning from the way they used to work to the new way. Someone reads an article of Google, how what is an SRE, and that's he adopts whatever he adopts from the article. And that's it. Like they don't really go kind of like the full length, which is really important. The role of the SRE is to define to the engineering team how an application and how a service should be monitored, how it should be observed, what are the parameters should be like that, what are the alerts should be out, what is the alert routing that should go. Their job is not to receive the alerts. Their job is to is to help the engineering team take ownership of their service. Now some organization, they say they want to do it, but they're not there. And what happened is the good old development team and monitoring team and just they call it a different name, but that's the reality. And the monitoring team receives the alerts and they're supposed to address it. And kind of like you used the development team as a second line of troubleshooting. If you will, I think that the reality is that companies, if you look at Netflix and you look at Google and you look at all the other companies, the ownership of the quality of the service lies within the business, lies within the engineering, lies within them. And the SRE team is there to define what needs to be done. What are the best practices? What are the tools that are being used? How do you do data collection? So you have some level of consistency throughout your organization. And I think we see it clearly of organization that haven't transitioned that they just like tell the story. But basically, their SREs are good on monitoring people. They monitor every single metric that you have. They have alerts for some of them and they have a playbook for some of them. But it's not just not the way it's supposed to work. And it's not scalable as you go to hundreds and thousands of developers and different teams that are all sharing the same environment. How do you folks make it easier for teams to navigate through some of these challenges so that they can continue to focus on adding value to their businesses versus getting overfilled with all these complexities and challenges? Yeah, I think the way we do it is we do offer them two ways of looking at their environment. One of them is for kind of like the SRE team was looking at, we have a Kubernetes 360 which you can look at the clusters and you can see all the information. You can see how applications are being laid out on them. And the second view is a service level overview. So you can see each one of the services, how it's reacting, how it's communicating with the other. Is it secure? What's the level of, if you're meeting your service level objective for that service and then you can do it? So providing these two views to two different sets of people is really helpful. That's one way. The other way is education. I think we've kind of like developed our own kind of like an observability. What's the well architected way of doing observability? And it has a lot of questions and a lot of guidance for organization that they haven't thought about on how to do it. And we guide them through the path because going from like monitoring to observability, it's not just like a simple thing. It's almost like a quantum leap. You have to understand that it's a different ballgame and it's different culture and it's different structure of organization. It's different responsibility and we help organizations through that transition. I think a lot of organizations start their observability journey within open source. The challenge with open source is it's limited. We're talking about three different leaders in the open source world. Each one of them does a segment, but together they don't actually work well together. So you look at logging, which is dominated by open search. You look at the metrics, which is dominated by Prometheus. You look at tracing, which is dominated by what used to be Yeager and now it's open telemetry. But each one of them has, it's a project. It's not an observability product. It doesn't tie all the information together like we said in the beginning. And what we do at Luxio is we help organization that started their journey from an open source to transition to observability, both from education also in a product where we overlay the observability capabilities on top of the open source. And that's kind of like the unique thing of what we do. You're not losing anything you've done so far. You still have the same capabilities, but you have observability capabilities that are laid out on top of the open source monitoring that you used to have. Asaf, thank you so much for taking time out today and talk about reliability, observability. And thanks for all those insights. And I will look to chat with you again. Thank you. Thank you very much. Thanks for having me.