 Hi, this is Yoho Sapil Bhartiya and welcome to our special series of let's talk interviews for SLO conf and today We have with us Estefan lips staff software engineer at procur. Estefan. It's great to have you on the show Thanks for having me excited to be here if you look at you know, the reliability is a kind of key attribute or feature that users of modest system, you know They expected yet reliability is not a traditional matrix such as response time For example, which can be measured using observability tools So if I ask you if you look at, you know, the modern whole cloud native cloud stack What kind of options are there for SREs to measure system reliability? Yeah, that is a really great question Because it touches on a key aspects on how SREs can be most effective So the key question here is really what is reliability and the answer to that Can be different depending on whom we ask but in the end the key answer is always one and that's the one from the user Customer and that is where over the last about 10 years We've seen a shift from using traditional observability metrics like you mentioned response time to Using higher level concepts like service level indicators as allies and service level objectives or targets The SLIs and the associated SLOs and error budgets Provide a tool for the SREs to look at the system reliability from the perspective of the user While also gaining visibility into future degradation when considering the current trends reliability trends And the key of the key to this approach really is to develop meaningful SLIs In other words, not just relabel the mentioned response time as an SLI But develop SLIs that cover the user experience from the user's perspective Where the term user experience really is a collection for the different user journeys that a system supports Unfortunately by now we have a lot of really awesome material available that covers the concept and the implementation of SLOs and SLIs So anybody who's just getting started on the SLO journey They have a comprehensive body of knowledge and experience that they can draw from If you look at things from the purpose, if you look at things from the perspective of an engineer Are there any potential pitfalls when it comes to measuring or quantifying system reliability? Is there any archaic we heal that engineers should be aware of? Oh, yes, absolutely Before we mentioned the concept of the user journeys But as the end the engineers we rarely experience those user journeys At least not like our real-world users do we are so close to the systems that we build and support day after day Day in day out that we often tend to view a system and the associated performance and reliability In lower level technical terms and details for example is our system available? Does it respond within n seconds? Do we get errors? now those are all of course valid metrics and they Impact the overall user journey or experience, but they don't paint the whole picture for example Consider a system that has automatic retries Just because an error happens does not mean that the user journey or the experience is compromised You could also if you want to call it the not seeing the forest for the tree syndrome So to effectively use SLIs and SLOs to measure reliability It is really important to take the bird's eye view of the system like a real user Because in the end it doesn't matter if the engineer thinks the system is reliable if the user thinks it is not Well, let's continue this question and let's say we have a system that has well-defined Service label indicators and targets SLOs Does that allow us to kind of comprehensively measure system reliability? Or is it also possible to have all system label indicators green yet the system may fail? well, that is absolutely possible totally and Again, it comes down to what we choose for our SLIs and how we implement them Another term that has been used is meaningful SLIs and a great example here is from Alex Hidalgo He did a talk called developing meaningful SLIs where I'm going to borrow an example from paraphrasing it So it's consider an example a web service for real-time stock quotes and let's assume that that system has SLOs for response time and response format So then we go in we request a stock quote and we get a response with valid JSON and The response time is within SLO. So at this point our SLIs for response time and data format are green But what if the response JSON data is outdated or if it's for completely different stock symbol that we requested Now at that point it doesn't really matter if the response was timely and syntactically correct because the system failed It did not provide the user expected response and the original SLIs failed to identify the system failure But in this example if we now use a data correctness SLI instead of the response format SLI Then we inherently verify the availability and syntactic correctness We have the same number of SLIs But better coverage of the actual high-level user experience earlier You're also talking about you know when we look at users they typically experience a system from a holistic point of view You would want system label indicators to kind of represent or measure the actual user experience Is there a way to implement a perspective that kind of approximates or Duplicates this holistic view or system for service level indicators That concept that you mentioned one of a holistic view of a system has actually been around for a while In systems theory we have the concept of a black box. It's a bit academic Originally and that describes system behavior in terms of a given input Resulting in a particular output Without any insights of what happens inside the box no consideration to the processing steps And now in our field software engineering this concept has been embraced by quality engineering Where we commonly categorized tests as black box or white box tests Now if you take a step back Conceptually the user journey aligns with the black box paradigm for a given input a particular output is expected Now we can adopt this concept to SLIs by aggregating granular metrics into higher-level SLIs That focus on the user journey or experience using that as an indicator of system reliability Can you give some example for a black box SLI? Yeah, sure So let's consider a system that creates a user account by queuing requests So after a request for a new account from the queue has been processed the user is notified via email standard So I've actually just did that this morning for a new website that I hadn't ordered from before So in a system like that the state of the queue can of course directly impact the user experience For example, if the queue gets beyond a certain length the user wait time increases And what if the queue gets stuck the user doesn't get their response? And so forth, but so as engineers we can be tempted to think well Let's fix that by putting an SLI on the queue But does it really allow us to model the user experience? Because if the queue gets stuck we're notified after the fact and the queue length is one But far from the only consideration that affects processing time So in fact if we look at it From the black box white box perspective SLI for the queue length is actually a white box SLI Now a black box SLI and that's the example you asked for would be a request for a new account It's the input results and sending an email the output with an activation link within one minute It's our success criteria Now in case of failure if the user does not receive the activation link this could be due to the queue getting stuck It could also be due to a whole bunch of different reasons I say back in database issues network and problems and more and a queue specific SLI would not catch those The high level black box SLI captures all those conditions And at the same time and lets us understand the impact and system reliability through the SLO's error budget and burn rate Stefan thank you so much for taking time out today and talk about this topic And as usual I would love to have you back on the show. Thank you. Thank you very much for having me It's been a pleasure