 Hi, this is Yohu Sapin Bhartiya and welcome to a special edition of TFR Let's Talk for SLO Forms. And today we have two guests from Live Step, Adriana Villela, senior technical leader and Ana Marganita Madina, staff developer advocate at Live Step. Adriana, Ana is great to have you both on the show. Super nice to be here. Thanks for having us. I've been a listener of these shows, so very excited to be here too and get to meet you. It's my honor to have you both on the show. And today we are going to talk about translating failures into SLOs. Before we get into the specific topic, I would like to hear from you folks, how would you define SLOs in today's world? I think the simplest answer to me is actually a metric that your organization uses the reliability goal. So we're finally able to put numbers to the path to reliability and make sure that they're very tied to that user impact and we actually all work cohesively in an organization to meet reliability as a foundational like KPI. Yeah, I mean, I definitely agree with what Ana said. I would say, you know, just to add to that, like SLOs are super useful. Like we know they're a super useful tool in the reliability tool belt. And I think that they're like a really key aspect to really like upping your observability game as well. They should drive what you alert on. Thanks for explaining that. Now, if you look at, you know, the terms like chaos, injuring, you know, when we look at this term, it's actually not that chaotic. It's still very strategic, very well planned. It actually involved the whole organizations. Can you talk about in today's world modern companies, how comfortable, how aware they are about this and how companies kind of tend to get comfortable with failures and practices like SLOs and chaos, injuring? I would say that many companies are still not comfortable with that, unfortunately. I think a lot of organizations are still trying to wrap their head around SRE, let alone, like trying to, you know, figure out the whole SLO game. Wrapping their heads around failure, I also feel like is very challenging for many organizations, especially the larger organizations. I think the smaller organizations, the startup-y type organizations, it becomes a lot easier. But for large organizations, there's like a cultural aspect to it where it's almost like taboo to say anything like failure. So then I think it becomes extremely difficult to get into that mindset of like, it's okay to fail, it's okay to like admit that there's a problem and to iterate on that. Yeah, I definitely see also what Adriana was mentioning. I've been working in SRE specifically for the last seven years. So the first few years were specifically working in chaos engineering. And it was interesting to see, like, I'm one of the early adopters, I'm helping companies be early adopters and not necessarily see like a big speed, but like we did start seeing some industries, such as the financial industry, start being like, aw snap, outages cost a lot of money. We need to actually embrace failure. So that was actually really neat to see. But I think like Adriana was saying, a lot of organizations are having a really hard time just adopting SRE. And like we need to remember that SRE is a lot of culture. You're coming in and you're revamping people's mindsets, the way you do software development. So all of a sudden, like you're trying to do too much at the same time. And yeah, sometimes when we look at this chaos engineering, it looks like you're trying to replace a string wheel of a car while you're on highway, but that is not actually how it is right, if I'm not wrong, it's very well orchestrated, very well planned, but it seems like that some of the challenges that are in the path of adoption of these practices, if I'm not is awareness, education, because tools are there, but the thing is once again, you can bring a horse to the lake, but you cannot make a drink. So talk a bit about what kind of challenges you see are there when it comes to adoption of practices like solos and chaos engineering. I can for sure speak on the fact that education plays a big part. I went to visit a lot of like companies and they were just like, we're getting started. We don't know what we're doing. Can you actually teach us what SRE is, what SLOs are? How do we actually think about monitoring and observability to have everyone working in the same page? So it was a lot of like, every organization has different issues, but everyone's going through the same chaos. So when you go through like education first approach, it's like, let me teach you and then we all collectively get to move forward to the next SRE path or we now get to transition to do something like chaos engineering. And I think a lot of what I am preaching is that like reliability is a team sport. Everyone in the organization needs to come together. Leadership needs to say, reliability is one of our OKRs. Like these are the metrics that we're looking at. This is going to be part of our quarterly review and until leadership puts that note, the rest of the leaders underneath and the engineers are not going to take it as seriously. And that also comes with making sure that the folks that are doing this heavy work are getting promoted and that you're also making sure to look at outages that are going out and burnout that folks are equally distributed and carrying the load of this reliability goal together. And as you were earlier talking about that, even when we look at practice like SREs, actually a lot of organizations, they still haven't figured out what actually is DevOps. Forget about that. So we like to talk about these terms internally. A lot of organizations struggle. So once again, the point is that cultural changes needed. You talk about kind of top down approach, but the organizations who are embracing things like chaos engineering or SLOs, what kind of cultural changes they have embraced so that when we look at them, you also have a kind of model to follow to each other. Hey, this is how it's done. Can you talk about that? I personally have really enjoyed a lot of the organizations that have come about and said, we're going to start just with postmortems. Like we're going to have blameless postmortems. And you know what? Leadership, you're going to be part of those meetings and you're going to see how we talk about the contributing factors across these failures, but not pinpoint and said, so-and-so took this action that brought down the system. So I think that's one simple inclusive way to start going about it. Another one of course is like doing table top chaos engineering exercises where it's like, we're now embracing failure. What happens if we lose access to our Amazon S3, if we don't have access to our database and get those mental models rolling? And all of a sudden you're showing leadership, like look, we're doing exercises because we care about this. And that also comes into play full circle of like learning from failure, learning from incidents. Like let's look at what has happened the last quarter. How is it that we can take those learnings and make sure that they don't happen again, set up SLOs, make sure that we write documentation and run books, automate processes, or even just actually realize that you have serial observability into your application and you don't know what the hell's going on and you actually need to start instrumenting that and move forward in that movement. I would also add to that that it's really important to have like, we've got the, we need the support from executives, right? But we also, I feel like it's, there's like a convergence right from top to bottom. And so you need also the buy-in of engineers. So a lot of like advocacy goes a long way, internal advocacy, external advocacy is always helpful because I think the external advocacy inspires the engineers at an organization to be like, oh my God, I totally wanna try this. And then they're the ones who will internally advocate. So now like I ran a DevOps transformation at one of the Canadian banks several years ago. And one of the things that we spent a lot of time doing was advocacy, talking to people, getting people to understand like what it is that we're doing, why we're doing, why this is awesome, why you should get on board. And I think by doing that, like you'll get a lot of organic adoption of these types of things as well. So it makes a huge difference. Now we have seen kind of both side, you know, the companies who are hesitant and the companies who are embracing it, if you just kind of draw or compare it or not, just compare it. But if I ask you, if you look at the overall picture, the adoption, are you kind of happy with that option of these practices? Or you feel, hey, you know what, we still have a lot of field to cover or you're like, the work, you know, it's natural progression of any such technologies. So where do you think we are? I think we need to go a little bit further. I don't think organizations are quite where they need to be. I think like there's a lot of positive thing coming about from various organizations because I think there's this awareness, right? Like observability, reliability, they're becoming part of the dialogue that is occurring within organizations, but I think we're still at that position where things are being woefully misrepresented, right? Like we're just rebranding operations teams, SSREs, just like we did with DevOps, like we inserted an entire team in the middle for this so-called DevOps transformation. So I think we're having the right conversations, but there's still a long way to go. I think having the right people in place at an organization to make that happen will really assist, but I won't lie, it's an uphill battle, right? It's not gonna be roses and ponies all the way there. Totally, it's definitely like every organization is gonna have its own struggles in adoption, which is definitely what we're seeing. And we're also gonna see that some are just maybe gonna stay behind, like their engineering teams don't really wanna embrace the heavy lifting that is needed to actually adopt this. The culture of their organization or the leadership is just kinda like pushing back, but they're also gonna run across a lot of issues, whether it's like scaling or reliability things that they're actually not able to do because they didn't do the heavy lifting of like transformation to whether it's like cloud native or whether it's reliability first principles, but I do think there's a lot of awareness going on in all different industries that make it seem like reliability should actually be talked about a lot more. So that makes me really excited. And I think we're starting to have those conversations continue just shifting left. And that's starting to bring people to think about like, how do we actually put reliability within our software development lifecycle? So that's actually been some other stuff that like I've been getting a chance to work on. And earlier we were talking about cultural, but we did not much talk about tools. And I don't wanna talk about a specific tool or a specific company, but how mature do you see the ecosystem is that, you know, even whether we look at top down or whether we look at, you know, deaf teams are, you know, embracing it that they also have the tools that they need available in the arsenal. How mature the market or the ecosystem is for SRES, SLOs and chaos engineering. I think it depends. I think there's some tools out there that like I think get it. And then you've got tools out there that we're doing a thing and they've rebranded themselves and say that they're now doing this thing. So it's really a matter of like when you're trying to decide what tool to adopt, you really need to, I think, have a good understanding of the space. Do your research, do comparisons, talk to people who are using the tool, talk to people in the community because I think that's gonna help because, you know, I, yeah, I wouldn't say like they're all equivalent and they all do the same things. Some do it better than others. There is definitely a lot of tools coming about which is exciting, but there's also the part that we also have to keep in mind that there's that marketing jargon coming along with it. So you do have to do your due diligence of like, is this a tool that actually follows SRE fundamentals that matches my organization? Because I see a lot of folks that are like, oh, so-and-so recommended this assess SRE blank. We're gonna do it. And it actually doesn't do anything for them because one, the tool was not what the organization needed. Like it didn't solve the critical problem that they were having. And two, like they didn't get any education on the tool. Their engineers didn't want the tool. So once again, like everyone needs to be involved in reliability for this to work. I see should be comfortable with wanting to use the tool and leadership has done the due diligence of like there's other companies and the same vertical that are using this and these are the wins they're having. And like Adriana said, talk to other users. And I think that's also one of the things that is coming about like these tools and SRE just continues getting more communities built around them. So we're getting more spaces to actually compare and take notes of like, what are you using? How's it going? Like, could there be something better? Like, I know so many SREs that have started companies in the last two or three years because it's like, we've built an exact same damn tool at three companies. This is getting tiring. And like all these companies don't let us open source it. If you look at the overall, you know, evolution of this SRE space, if you look at, you know, if I ask you, what is one thing that you would like to see this year? It could be cultural, it could be tools, it could be open source, what would that be? For me, I would like to see observabilities part, more part of the SRE conversation. I feel like right now observability is still a separate thing that's done from SRE, but like honestly, I don't think you can be successful with SRE these days without observability. So I don't want to hear observability separate anymore. And for me, I want to continue pushing the envelope and bringing reliability and observability to the software development life cycle. So what that means is kind of like putting those quality gates before you deploy to production that allow you to say like, how does this actually do when we do deploy to production? Like does this lower the service level objective? Does this actually even have documentation set up? Like make sure that there is a lot more of like oversight into the things that are going out to product to customers so that we continue putting them first and think about the impact that they might be having with every single code that we put. And that comes with also getting a chance to then observe how are all of our deployments and going. One more question I want to ask you is that if you look at the top leadership, which is all about mostly business and when we look at developer, which is more about running that project code-based successfully, what kind of advice tips you have for them so that they see observability, chaos engineering as at one point something that has a lot of business value to them for both executive as well as developers. What would that be? If I let choose you as an advocate for observability. I think to really to really advocate for observability for a developer, I think it's pretty easy because I think developers have been at the position at some point in their lives, in their careers where like things go wrong and then they get called in the middle of the night. And so the, I would say the compelling use case for observability is, hey, wouldn't it be nice to have the information at your fingertips to be able to resolve this weird thing that's happening in the middle of the night in a relatively timely manner. And then to pull up the heartstrings of the executives, I'd say, follow the money, right? If you show them like, hey, if we have this observability thing in place, we're gonna save you money because our outages aren't gonna take as long. Hey, super compelling argument there, right? I think that definitely is a great net like, I think that definitely goes straight to the point of like, why is it that these things have to happen? And I would also add that you can see it as like developers and SREs can leverage service level objectives as those metrics and those North stars for them to guide them of like, this is the software we need to keep reliable. This is what we instrument. This is what is the critical path of our application. How do we protect it? And then make sure that that's actually being tied to those OKRs that leadership is putting in and those key performance indicators that show to the business, like we're actually getting more orders in, we're able to have like users coming back to like continue the user experience or referrals or whatever is the metric that the business shows is like, we're growing the business year by year. How is it that those service level objectives can actually tie and like, that allows for those conversations to be easier because all of a sudden you're talking in similar language. You're now comparing percentages to like, this is how much money like the business is bringing by keeping this critical path happy. And last question I would like to ask you because you folks also talked about open source and of course, open source is kind of my burden, but in one way or the other, what I want to understand is that as you're saying, you know, we don't see much open source all the way to this space. We all know the value of, you know, contributing to open source, but if I would ask you because you understand this ecosystem better than anyone else is that what tips do you have for vendors, organizations, you know, stakeholders, office, sorry, observatory space to become good open source citizen? Why should they do that and what value they get out of it? I think for me, a compelling reason for organizations to open source is that, you know, they, we find ourselves often in positions where we keep reinventing the wheel and to be able to like, you know, you develop something cool that solves a business problem for you, putting it out there into the world gives that cool thing that you developed an opportunity to blossom further, right? Because, you know, you have certain ideas of how something works but then someone else might have other ideas on how to improve it. So now you've turned this thing that, you know, did a certain thing relatively well, you can turn it into this thing that does that certain thing really, really well. So you have that, you know, you bring new ideas to the table. I think it also raises the profile of the organization as an organization that is open to collaboration and open to sharing. So I think it ends up being win-win for everyone. And I think a lot of developers would love to be more involved in open source and, you know, depending on the organizations they work in, they're restricted from that, which really sucks. So I think having that, like, so you end up with happier developers as well. So I think it's a win-win for everyone. I think what Adriana said is like, definitely a lot of my thoughts. And I would also add where it's like, by putting it in open source, you're also sharing your great ideas and allowing for contributions in the sense of like, you actually might realize that this ties better with another tool and we actually build a better platform. If we actually take this and apply this other machine learning open source project and like get more contributors all of a sudden, we're building something even better that we didn't know we can do. So we are moving towards making technology a better place. There's also the part of inclusion of like, technology needs to be built by worldwide users, not just one country, one city, which is a lot of what we've seen with Silicon Valley and open source kind of goes against that of like, you just need a computer and you have to be willing and hungry to learn. And people are there that are gonna wanna help you to learn how to code, to teach you systems, to teach you Kubernetes, to teach you how to be a community member and write docs or whatever the case might be. So I think those are some of the things that like, when we position it that way, it's like you're getting talent that you actually weren't able to reach in the past and this makes you a better citizen. This makes your organization show up better for all of your customers. Adriana, I'm not thank you so much for taking time out today and I talk about this topic today and I would love to have you folks back on the show. Thank you. Thank you. So it's great. Thank you.