 Hello, everyone. Welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I'm Annie Telasto, and I'm a CNCF ambassador, and I will be your host tonight. So every week we bring a new set of presenters to showcase how to work with Cloud Native technologies and the stories behind them. They will build things, they will break things, and they will answer all of your questions, so you can join us every Wednesday or like this, sometimes especially on other days as well. This week we have amazing set of presenters here to talk with us about designing and operating reliable Cloud services, a view from the trenches. So very exciting for this special program today. As always, this is an official live stream of the CNCF, and as such it is subject to the CNCF Code of Conduct. So please do not add anything to the chat or questions that would be in violation of that Code of Conduct. So basically, please be respectful of all of your fellow participants as well as presenters. With that done, I'll hand it over to our amazing presenters to introduce themselves. Who wants to go first? I'll go first. So my name is Anurag Gupta. I'm the founder and CEO of shoreline.io and one of the founders of reliability.org. Both correlate to my lifelong interest in building highly reliable systems. My name is Niall Murphy. I am CEO founder of a teeny tiny startup in the SRE ML space called Stanza Systems. If you know my name, though, it is probably because of the site reliability engineering book, the SRE book, and or possibly the ML Ops book, which are both explorations of what it means to build reliable systems. Stephen. I am Stephen Townsend. I am currently a developer advocate for a company called SquidUp, who does unified dashboarding and visibility. I did performance engineering for many, many years, moved into SRE more recently, and I have a podcast called Slight Reliability where I share share my learning journey in the reliability space. Perfect. Amazing group of people here with us today. So today we have a great panel format where I will ask questions and then we're going to get amazing answers. But as always, you can still ask your questions to your audience as we go along, and we'll probably have some time for Q&A in the end as well, but ask them away as we go along as well. So first question to everyone here, what does good or great look like when discussing reliability? So we only have an hour, right? Yes. Yeah. So just off the top of my head, I don't think there's a context insensitive answer to this. Like I don't think there is one answer which just fits everything. And I expect most people in the industry wouldn't be surprised to hear that. I will say there's a couple of fundamental questions you have to be addressing or addressing to a sufficient level in order for this question to kind of mean anything to you really. The first one of course being how much does the org care about reliability? Like if it doesn't care at all for whatever reason, like there might be arbitrary reasons, then good or great doesn't really matter. I would say though that making sure the basics are kind of done well is always going to be a foundational thing you have to have before figuring out what good or great looks for you even make sense as well. And that stuff like are you measuring what it is that you're doing? Have you decided organizationally how reliable you should be? Like if you haven't decided that and you're just getting, I suppose, whatever the native platforms will deliver to you, then you have a huge decision to make about precisely what to improve where. And like are you resourcing it sufficiently? Do you have the right number of people, the right kind of people, the right resources, et cetera, et cetera. There's a lot of folks who are even struggling with the basics before we come to the question of good or great. I will say though that if you do define or if you do have the right measurement infrastructure in place, if you have defined what your levels of reliability should be and if you are sufficiently resourced to meet those levels, well, that looks pretty great for my point of view. Great. So I'd say great to me means that your system is running at a level basically indistinguishable from a hundred percent. So it's a receding horizon. You can always get better. Now here I'm talking about reliability, not availability. Sometimes people confuse the two. They talk about their fleet wide availability but that ignores things like am I getting errors back from the system? Is the performance out of whack? And I often find that people take a fleet wide availability goal and then they claim they did really well even if like one region was down for an hour. That has happened to me with some of the services I used to work on at AWS with the services I depended on. And your customers just don't care about your fleet wide availability. That's an internal goal. What they care about is their particular experience. Were they able to check out? Were they able to perform their task? And in the expected time without a new drama and ideally without having to code against a bunch of error cases due to efficiencies in your software. I think the ultimate goal of reliability is to support an organization to meet its objectives, whatever they might be. So because reliability just for the sake of being reliable, I don't think necessarily is enough. I mean, it needs to have a purpose behind it. So I think maybe good reliability is when the org is achieving its business goals, the customers are able to achieve the outcomes they set out to achieve. The ops team enjoy their work hopefully or are not totally stressed out by constant incidents and fires burning all the time. The technology is easy to operate. Incidents are manageable. And of course the technology is also reliable and performance well enough to do what it needs to do. So it's maybe a bit fluffy, but that's my take. Yeah, it's a good fluffy take, no worries there. Perfect. So next up we have the next question which is tell us a story about a particular or unusual reliability failure that you have seen. Was it a failure of design, implementation or operations? Oh, I'm gonna flow on. So I have a story, this is actually from my performance testing days. So I was, and it was the system which was sort of on the side no one was thinking about. It was a system of record, used to be a mainframe. Over the years it got upgraded and it got put into a mainframe emulator running on a Windows server. And at one point we needed to take some of the data that system used and put it in a big cloud repository. So this mainframe system which was emulated would call to the cloud to retrieve customers or something like that. Turned out that we didn't know anyone knew this but the mainframe emulator which also its own database was an SQL server was considered an external component. It would make calls externally to the cloud and there was a performance issue. It would take like 15 seconds to retrieve a customer. The thing is that those calls were single threaded so that when it was making a call to the cloud to retrieve a customer the entire system would freeze for every single user for 15 seconds. It couldn't even reach its own database to do anything. It was the worst thing I've ever seen. And I don't know, when I think back in it I think it was, yes it was a failure of design. Probably more, it's more of a failure of too much technical deed. The system which hasn't been cared for for maybe a decade or more and it just hasn't been thought about. So no one knew about it anymore. So when we're introducing new cloud stuff to the old world and not thinking about and understanding how these things are going to end right. So I spent about eight years at AWS. I saw a lot of failures there. And I'd say the biggest ones were almost a failure of imagination where the system had an error path to deal with the resource becoming problematic but it wasn't able to deal with the at scale case where the failure cascaded. So for example, the week before I joined AWS there was a huge outage in Dublin due to a lightning strike. And that because so many different machines went out at exactly the same time we had this remarrying storm and where every machine was trying to remarry to some other machine that was trying to remarry to it. And there just wasn't enough capacity and nothing was going through. And the reason I call it a failure of imagination is because we all can see and code against the things that everyday things that happen all the time. The question is really thinking about what's the largest scale thing that you're going to go and deal with? Is it a region? Is it an availability zone? Is it a hundred hosts? I don't know, but designing for that is really, I think, incredibly important when you're designing reliable systems. Well, you must have seen a bunch. Again, the margin of this call is too small to contain all of the things. I actually, in a rare example of bone trumpet blowing, I actually run a podcast called Getting There with Nora Jones, the CEO, founder of jelly.io which actually goes into very much detail about incidents and tries to examine them under the framework or LFI framework or a socio-technical systems framework or a safety two lens. There's an emerging movement about how folks actually analyze and respond to incidents which is making quite a lot of inroads into the SRE community in particular but also in hospitals and like chemical engineering and aviation and a whole bunch of other fields where the standard reaction to incidents can be possibly a little bit more blameful than we might like and the organization doesn't learn as much as it would otherwise but in terms of kind of bite-sized stories, like I've loads, I suppose there's the time that somebody de-referenced a pointer incorrectly and everyone saw the data for customer zero. That was pretty cool. That was not so good because obviously customer zero did not want to show their data to customer one through empty million but it was relatively simple from a kind of a contributing factor point of view like the code got changed and pushed. The difficult thing was figuring out what the correct privacy and legal response was now that you have this data which shouldn't have been accessed by other folks then there's the ones where the technical thing is maybe clear but you're working internally or externally with folks who might have some trust issues with the framework of authority that you're dealing with and so you have to kind of establish trust in the middle of an active incident. I have one with a public sector in Ireland where the technical cause was actually relatively clear from day zero and was something to do with traffic routing but folks on the call weren't in a position to believe that a large organization could make guarantees about how their systems were performing that the smaller organization was kind of ready to really absorb. So there was primarily a social conversation rather than a technical conversation and then a little bit like I suppose Anorak's story early in the days of S3 the AWS system used a gossip protocol to tell other machines where the data lived and so if for example a system or a data center lost power and all of the machines came back up at one time they will go hi, I'm machine X I have chunks one through 70 million and hi, I'm machine X plus one and they would all tell each other and completely flood the network and they would completely flood the network even worse because there was an assumption that this network which was actually split between two data centers had exactly the same bandwidth characteristics on every point to point link which of course is not the case if you're going between two data centers. So that was fun. Perfect, a good variety of fun challenges and failures there, great to hear. But with these experiences in mind what design recommendations can you share for delivering services with maximum availability? I'll start. So my first take is you should automate everything and that you can. So because we've done a decent job over the last 2030 years in improving quality. And so I think it's now time to provide that same sort of engineering rigor to reliability as we did for quality and have things go through pipelines, inject failures, see how they're handled simulate large scale events to see if the system can heal. So I feel like the more this is software the more it scales, the more you can manage it like it's software as opposed to people and processes intrinsically hard. People change processes. Sometimes they're followed and sometimes they're not. I think maybe the biggest challenge right now is just how distributed our systems are and how many components are that need to talk to each other. So we talk about zero trust for security. I think in a way we need to start thinking about reliability in a similar way. So maybe being careful about the vendors that we choose to connect to and making sure that they're reliable because at the end of the day you are only as reliable as the dependencies that you depend on. That's a whole nother way of thinking I think. And also realising that there are going to be maybe dozens of components that you're out of your control that you depend on. I'm gonna say yeah and building sort of degraded service around that. So if this thing goes down have a backup plan or provide degraded service. I think that's one thing. And I guess the other thing is of course having really effective observability and that's easy to say the word observability but I think being able to pinpoint what you know to understand and see clearly what's happening in a really complex distributed system. It's hard, it's hard to get right but when you get it right I think it makes a big difference to improving reliability. Yeah again there's an absolutely gigantic amount of stuff we could say here. It's even a struggle to keep it to 56 minutes. So I think I'd start off saying like there's a bunch of things you could do with different levels and some of the levels are maybe relatively tactical or code focused or whatever. One thing that believe it or not contributes a lot to reliability problems is have you checked the return code from your system call or your HP request or whatever what happens if that does not complete and the information that you were hoping to get does not in fact arrive in the buffer and you just pretend the buffer is there and has something valid in it and you proceed to crash merely. Like there's all kinds of stuff at that level. Then the level up where you're going, okay boxes and lines, how do they communicate to each other? Is there some kind of resiliency? You can think about there on a design level. There's a, it's not quite a programming language more kind of modeling language. I'm quite fond of called TLA plus, Halel Wayne and a bunch of other folks do popularizations from it. Of course it's a Leslie Lamporth production from the same folks who gave you or same person who gave you PAXOS another leader election, master leader election kind of primitives. But TLA plus allows you to kind of model state transactions in a distributed system and say, okay, if this goes to here and we message this thing back with this other information, can we prove that this is correct or can we fuzz it enough so that we can look at some potentially, some cases where potentially reliability might be in threat. And actually Amazon used TLA plus in a bunch of stuff and discovered like there's some 39 level stack, deep stack of stuff that actually ends up being problematic a little bit like, I think somebody proved BGP doesn't actually ever converge a couple of years ago or something like that. Anyway, lots of things you can do. Perfect. How are we feeling? Do we want to take an audience question here in the mix? Oh, perfect, enthusiasm, that's nice. So we have Lauren George asking, how do small slash medium sized organizations predict themselves from outages from big public cloud providers without breaking the bank? Is it even possible today? So it depends is the silly answer I'll initially give really depends on what your app is trying to do really depends on what your dependencies are. There's actually a fascinating piece of work from I think Walmart who are in no sense a small or medium sized company but a fascinating piece of work from Walmart who do kind of multi cloud spot pricing for instances for stuff. And that's almost the definition of not breaking the bank I suppose or breaking other banks. But the basic deal is figure out what you're depending on. Figure out if in your application you can use specific cashed results or algorithmic fallbacks or hard coded fallbacks of various kinds. Note all of these may potentially have reliability outcomes that you were not expecting at some future point of time. But if US East One disappears and you are using some subset of stuff but you can limp along with cashed results for a while like actually that's kind of a win. A lot of people are pushing multi cloud at the moment which there's been a lot of discussion about in the industry. And basically the feeling is it doesn't make a whole lot of sense to go copy paste for your infrastructure from one provider to another. They work hard at making sure that you can't actually duplicate that kind of thing a little bit like mobile telco billing models I suppose but the real deal is if you are using a specialized thing that's only available in one provider make sure it's something you can do without for a couple of hours if you need to like big query equivalent. But I'll stop talking now. Yeah I think there's a lot to unpack and answer. Let me add on to it a little bit. So I think the question really becomes to think about reliability the same way that we used to think about security in terms of threats except now the threats are when you lose availability of your dependencies. And so it's for how long can you relate? How do you degrade when that happens rather than fail entirely? Like if you're losing, if you're using Lambda and Lambda goes down entirely you're kind of out of luck. If you are using VMs and a VM goes down you can probably spin up another VM particularly if you've got a warm pool already out there. So it's really a question of tolerating failures and you can't just duplicate everything, right? Even you, what's your view? Yeah I don't know, I guess my two sense is that if you're a small to medium organization and you're thinking about going multi-cloud have a serious think about the risk versus the cost of what you're trying to implement in the complexity of it. I think it would be a hard case to say that it's actually worth the effort and the cost and there's other things you can do like Nile was saying. Great and then we had a few more questions but I think we're gonna go through some of the pre-decided questions first and get back to a few of those. So Andrew I saw your question about case engineering we're gonna get to it eventually as well as Lauren asked about books or articles and I think that's a great question to maybe wrap up in the end towards if any of our panelists have great resources for everyone to jump into next. And then also I saw the podcast question and I'm gonna send the links to everyone in the chat. So no worries there. And then to the next question which is how do you identify and we saw potential reliability issues before they become your customer's concern? You know, at AWS one thing that we all did even Andy Jassy did was monitor Twitter because at least at the time there was, people would go onto Twitter and ask is that three down or something like that. And your whole goal was to make sure that by the time someone was asking that question you already knew, you already had the event started you're already working on it. Maybe you didn't have it identified the root cause but you're working. And so I think it's a really important question and it's a really important goal we should all have because what you're doing as a cloud provider is, you know, you're taking off, you know, you're taking on the responsibility that your customer would otherwise take on themselves. And so they need to trust you and they need to trust you to care about them in a way beyond what they would do on their own. So, you know, you just need to be really good at this stuff. Well, obviously I think great observability is important. So, and I think that one of the keys there is to focus in the beginning. Let's say you're starting from scratch start with making sure the customer can use the service rather than getting lost in all the myriad of technical metrics and events that you could be tracking and looking at. Cause once you can, if you can't answer that question can customers consume the service then nothing else really matters in my opinion. I think that SLOs are a good way to potentially do that but I don't think you have to do SLOs either but that's contentious but that's just my particular opinion. And I think the other thing that's coming from a performance testing background is to be testing reliability during delivery especially for a new product which maybe you can't just go live with immediately and have load on it because there's no customers at first. So, testing it is a great way to shake out not everything because real customers in the real world do unexpected and wonderful things but you can't shake out a lot of issues and understand your solutions and systems and services better. But beyond, so those are the I think ways you can identify issues before customers do but beyond that there's all the other things we can do to mitigate the impact. The way that we deploy, you know doing deployments like blue green deployments or canarying or rapid rollback and things that can reduce risk there. Learning from incidents not just having incident secure and not gaining something from that because incidents are fantastic in terms of learning and growing as an organization. And also I guess the last thing is really prioritizing the services that really matter the most to your organization and your customers rather than trying to treat everything equally when you might have some administration API that no one uses or it's not important, you know. Okay, so I'm not saying you should do this but there is an improvement loop available to you a little bit like Anna Reich was saying where the answer to what to do for monitoring is you do nothing and you wait for the phone to ring and the phone, hello, sides down, oh, sides down. No, thank you very much. And then you figure out what went wrong and then you turn that into an observability rule and you just do that a billion times and eventually everything is covered, hey, except of course everything is covered. And so you have a billion things to monitor and it's not necessarily clear which seven of the billion things actually matters. So that's a question that you can also resolve with this other magic trick, which is we're pretty kind of big backend people, right? So we think about things in terms of microservices and communicating via SLO or communicating via RPCs and setting SLOs and so on and so forth. That is backend language. It's not front end language. And the interesting thing about the 2023 catch point blameless SRE report is that it says only 35.8% of respondents said that they had client side monitoring that fed into their observability that they could make decisions about. And I think that's the huge gap that can help a lot to plug particularly or to plug observability problems, particularly if you haven't done what's called a CUJ analysis or a critical user journey analysis for your site or your service in some sense to figure out what the actual people actually want to do and kind of instrument that. So lots to do there. Great, good suggestion there. So then getting to the next topic is reliability.org. So why do you see the need for new reliability.org community? So the reason I started reliability.org which is a nonprofit, nothing to do with showing any of our own goods is that building highly reliable systems is something of a black art. And it's mostly informed by just bitter experience. And that's okay at a hyperscaler because they've got lots of people with lots of bitter experience and they get better. But it's a problem for the rest of us, right? And there's just no good place that you can go to offer your thoughts or to get advice. And so that's Twitter kind of used to be but it's less so now. So you kind of want to safe place. You can go without a lot of noise and without a lot of vendors that you can do this. So now I asked other people like Steven to join and yeah, it's early days but I'm actually enjoying the conversation in there. That's great. Anyone else want to add therapy points? Yes, I do. There's lots of nuts. I don't know. Okay. All right, come on. I actually wasn't aware of any other sort of reliability communities out there that were not based around a particular technology maybe or an open source project. So I haven't found one before. So for me, it was like a new thing. Maybe that's just because I wasn't aware of what else is going on. I also think that there's generally been a split between open source communities who are very active from what I see and also these sort of commercial communities of people who are built around technology like AWS or whatever. All right, and so bringing those together I think it's quite exciting and getting those different perspectives. Yeah, I think cross company collaboration is important as well. Yeah, I think we could talk more about that but I know someone in New Zealand who's an SRE and quite a large important organization and he's the only one. He's a waini SRE. So he has no internal community whatsoever. So the only chance for him to get ideas and the bounce ideas is to go out to the industry. And the other hand, he's living in New Zealand. I mean, could be worse. So I hang around a lot in either role focused or conference focused or sometimes technology focused slacks is what it tends to be these days rather than mailing lists or whatever. So I like the idea that there would be a kind of a cross role, cross company, cross industry conversation about this that isn't tied to any one particular thing. And that seems to me to be a win. Great. So how in the future then now that this great new community has started how do you see reliability.org community growing and evolving in the future? Let me start, communities are hard. They take nurturing. They take constantly adding useful or thought provoking content. And it takes creating a safe place where you can offer your opinion even if there's an expert like now hanging out there who might just say, well, my experience it's the answer is 12. And then what do you do? Now it's not like that you shouldn't work but now you want to reach provider or response. The answer is 13. Let's just get that right. Yeah, I mean, I don't know. Communities are hard, things are hard. Maybe it'll be wonderful. Maybe it won't be, I don't know but I will say that I'm increasingly anxious about the future of the SRE profession in a world which is I suppose growing increasingly unreliable. Is that a good word for it? I suppose. And in a sense, I have this whole thing. I talked at SRE kind of a number of times about weaknesses in the intellectual foundation for justifying the value of reliability and so on and so forth, right? So I think those questions are unanswered. I think that part of what's happening right now is this idea that actually things can be worse and it's totally fine or maybe it's not totally fine but we'll just do it and move on. And I think that's like terrible user experience and a terrible abandonment of contracts with the user, right? Like if nothing else, emotional contracts even if they're not kind of SLAs. And so I think that kind of stuff needs attention and generally one of the ways it gets, it makes progress as via cross-community conversations. I also see the potential in the community for mentoring, mentee relationships potentially, something we could extend upon and yet this has already been said. Just the idea of having a place where you can put out an idea, bounce an idea with people with a whole wide range of experiences. It's just fantastic and it can't be, I can't undersell how important that is. Great. And then Stephen, as you were one of the first people to join the reliability.org community, what made you jump right this way in? Yeah, so I work for a smallish company around 100 staff, we're building a new product. We haven't hit that inflection point yet, we're reliability is the key thing. It's more about building the right product at this particular point in time. So I was really excited to be, to have a place where I could go and sort of keep my finger on the pulse and hear what's happening in the reliability world because I'm not getting the chance to do the work every single day. So that's sort of a great thing. And I just really like the vendor neutral nature of it, of the community. As I said before, most of the communities that I've been part of or have heard about have been around a project or a technology. And this is great, I decided I don't know what's gonna happen, it's, you know, it's great. That's amazing. So to everyone here, how can people get involved with the reliability.org community? If you go to reliability.org, the website, you join the Slack, you introduce yourself and then you start contributing. I mean, it's pretty simple. And the more people we got, the more activity there is. I mean, finally, you know, the communities are all about participation. Great. Good answer since no one wants to, I guess add anything. Yeah, perfect. And then now I think we're gonna grab one audience question while we then go next to the other questions here. So Andrew asked before, how trendy is chaos engineering? The practice Netflix pioneered a few years back. And he adds, of course, your news wasn't as popular as it is today, but your takes on chaos engineering. So if you'll forgive me for putting on my kind of copy editor hat or like parser hat and going, how trendy is chaos engineering? Like, do you want a scalar answer, like 13? Or are you saying, how important is it that I should know about chaos engineering or how relevant is it in the industry? I have only my opinion here, like I don't have strong survey data or anything like that. I think one credible group of people who are doing this are verica.io, if you've come across those. They also run the Void database, Courtney Nash runs the Void database of kind of incident data out of Verica as well. But Verica's main thing is kind of chaos engineering. Chaos engineering is really useful. Like it's a really fundamental technique instead of just waiting for things to arbitrarily break in your production and tidying up after it, you go, okay, I'll break a tiny bit of it all of the time. And if I break the right tiny amount of it in the right place, I'll hopefully learn something that I can make progress on with respect to improving reliability in my production without actually having a complete wipeout event. So it kind of in the, if you think of adages or whatever as kind of a tree of possibilities flowing from some kind of single node, then chaos engineering helps you to kind of depth first explore a bunch of potential failure modes that you might otherwise only really encounter after they've set something serious, serious off. It has one particular downside, which is, as I understand it, I've no direct evidence for this, but it has one particular downside, which is people go, okay, so you're gonna break my production and they don't like that bit. They go, I would much rather just wait for it to break completely rather than break it a little bit all of the time, because then the moral failure is somehow not directly connected with my actions, which is not true at all, of course. But there is an issue with having the case for it kind of resonate with certain kinds of audiences. That's all I can tell you. Would it be fair to say that if your engineers are already fighting fires constantly and there's a lot of technical debt that chaos engineering is not probably a good action to take at that time? I suppose you could make that argument. It depends on whether or not the, because often chaos engineering is quite complex to set up because you can't, like you can set up a simple bot that goes around zapping arbitrary VMs every so often, but if the subset of VMs that it zaps aren't performing different functions, you only ever learn the same thing that you were gonna learn anyway, so it's not adding to your additional stock of knowledge. So in order to be really useful for the organization, the chaos engineering has to be doing something nastily and new to you every time. But if you're getting nasty and new things happening to you all the time anyway, like that's not much additional value. So why don't we just do the thing that we're doing right now, which is nasty and new until it stops being so new at which point we can introduce chaos engineering again. So my quick thought is that chaos engineering can be done chaotically, right? You know, where you're just doing random things in the chaos monkey kind of case. And I don't personally find that terribly useful. You know, it's kind of fuzzing compared to thinking very carefully about your testing framework or security framework, but I do think is incredibly useful is fault injection, where you really think deliberately about what percentage of things do you treat as a fault as you call a subsystem. And how well do you deal with those things in terms of retries, in terms of redirects, whatever it is. And that I think can be done in a very careful methodical, you know, the way that you can actually get some use out of because it's very hard to get useful knowledge out of randomness. Great. If no additional comments, hoping on to the next topic, which is what are the top causes of major site outages and how can people avoid them? Yeah, so there's some old data from the second three book suggesting that around 70% of outages flow from change of some kind, like config change or binary change or whatever. So stop changing and everything will be fine. Oh, hang on. Actually, I'm very sorry. It turns out you can't stop changing. Okay, so what we have to do instead is to change in a particular discipline fashion so we don't trigger the unexpected interactions between attribute sets A, B and C on this thing and attribute sets C, D and E on this other thing. And that looks like a bunch of kind of relatively well-known best practices which are still not universally done today, but it's the ICD testing in production, canaries, fast rollback. I once worked in an organization that wasn't able to rollback and sometimes wasn't able to roll forward, which was also interesting. So organizing your productions such that those things can be done is more or less the first step towards tackling that root cause for a better term for most of those outages. This is just purely speculation, but I feel like an increasing, I feel, I think an increasing number of outages are gonna come from the dependencies that we have because we've got so many dependencies and they're growing all the time. So that's, I feel like there's gonna be an increasing source of incidents and outages. You can feel Stephen, you're a human being. You're a land to feel. I used to work with a guy and he used to tell me that SRE is DevOps without empathy. That's amazing. Back into better experience, I see. Well, I think about, yeah, at AWS, we used to have a two-hour meeting every week where we'd go through the prior week's outages in important services. And so if I think about the collection of, shall we say, greatest hits that were on replay across those weeks, there are always things like database issues, bad deployments, expired certificates, misconfigured network settings. And it's very similar to what Nile was saying. And what's common across that set is that there's widespread severe impact because they have a large blast radius and that they take time to resolve. So how do you deal with that? Well, one thing is that you plan a deployment rollout so that you control the blast radius. You automate the rollback of changes so that you can minimize the time of failure because it's basically an integral, the severity, the duration and the breadth of impact. And so it's, you can reduce any one of those dimensions that I think make progress. Databases, I mean, we pretty much stopped using databases internally because at least Charlie Bell used to say, it's like putting a switchblade in your baby's crib, just don't do that. It's a complex software that's easy to use and you should stop using it. Just use definitely TB. Opinions may vary. I've written a lot of databases over the years. So something about that. Yeah. Stop using databases would definitely be a message I would not expect to issue from one Anarigup to Esquire. Yeah. I use SQLite at shoreline for all of the bitter experience I've raised. SQLite, like SQLite is awesome, actually. The unit test framework for SQLite is particularly awesome. But like type safety is for Wuss's, I think it's the, general idea. Yes, sorry, stop. I'll stop there. No, no, no, it's good. It's great. It's good to have some discussion. So how do the best people out there manage their cloud environments? When I meet them, I'll let you know. I like, the best people isn't necessarily the right framing for this, right? Because the environment and the resources also matter a lot. And if quotes the best people don't have the right environmental resources to do the work, they'll go to somewhere that does. So like, there's a lot of nuance behind that question. But I would say that a lot of the things we actually talked about earlier in this session with respect to understanding your dependencies, figuring out observability, figuring out critical user journeys, making sure you can roll back. All of those best practices are things that quotes the best people and quotes are either doing, or they've got a good excuse for why not. And sometimes it's a question of picking your excuse. I'll add in the idea of designing for reliability. So, you know, it's like, for example, injecting faults in production, which can be sounds scary. But, you know, it's really a question of resilient architectures which can handle it. So for example, you know, I built Amazon Aurora and it effectively injects a large scale of end every week because it does a deployment, which breaks, which, you know, takes one out of six elements of the quorum out while it does the deployment. And, you know, but it handles it without any drama. And that's just because it's designed to deal with that failure and dealing with that failure means you can also deal with AZ failures or disc failures or a network failures and blah, blah, blah. So, you know, I think that sort of basic design methodology, I'm not saying you should, you know, I'll go do six way quorums, but, you know, the, I think that kind of designing for the fragility of the environments in which we operate is important. Great. So now let's take grab another question from the audience. So we had Luther asking, my company embeds an SRE team in rotation with different teams in hope to work closely with them to improve reliability and monitoring occasionally in house workshops. Any downsides to this approach? Any opinions? I'll take a crack at it and then now can correct me because at least there'll be something to approach. So I think the key question in here is about ownership. So who owns the issues, the failures, how did they escalate the, I've seen orgs that have put in SRE functions, but everything still flows to engineering to fix because it's just a bump in the, you know, in the wire. And that's useless, right? And, you know, having someone look over your shoulder and just tell you, okay, implement, you know, you're not doing processes well enough. You're slouching, you know, you're not typing correctly. You know, that's not helping me make things better. So what does help me is that they're actually there shoulder to shoulder fixing things, which I think, you know, Luther's point kind of touches upon the notion of embedding together rather than treating it as a cascade or treating it as a separate retrospective function. Now, you know all about this. Yeah, I mean, there's a huge amount of nuance to this, depending on what the definition of team rotation, different team, et cetera, et cetera. All of those could have a huge impact on what the end result ends up being. I will say I am most familiar with the single individual embedded model rather than the team to team embedding, like team to team embedding seems weird. That's like, that's a partnership model, not an embedding model. I think the weakness of the embedding, the single individual kind of model where you would have an SRE that would go and sit with a team that has a particular reliability challenge or some kind of knowledge deficit or whatever for six months, say, that's a pattern that's very common in Facebook production engineering. Last I was aware. The difficulty with that model is, like the thing it's good at is responding to particular emergencies or lacks quickly. Yes, these 17 teams have some problem. We will send staff out there and they will fix stuff and so on. Yes, like often it does help, but what tends to happen is if you do a lot of these kinds of rotations, you've no real team identity, at least not one that lasts longer than the period of the rotation. And that turns out, even though you might not think it's that important, it actually turns out to be really important with respect to giving people the idea that they can build a career and have a kind of a long lived contribution to long running services, et cetera, all of which are kind of necessary subcomponents of promotion amongst other things. So that's kind of the upsides and downsides of embedding. The other question I suppose that kind of hides behind some of that is, why does the team in question field, they can't do this themselves? Like, are they looking for warm bodies? Because if they're looking for warm bodies, that's generally speaking, not a good sign. It's like, I don't care who they are, just get them cranking out the code now. Well, actually, that's maybe not good. Or are they looking for a specific guided expertise on something in which case, that can sometimes be a bit better? Yeah, it depends, he said, jokingly. And in my last role, we had a different kind of embedding. I guess we were in a enablement team. And so we would spend time with one team at a time, helping them uplift because they asked us to all because there was a particular need. And I guess they worked quite well, but the challenge was that if the team had competing priorities from an organizational perspective, like you must deliver these bunch of features by this deadline, that would just completely blow away anything we were trying to do. Because I just couldn't listen. They were too busy worrying about all this other stuff they had to do. So if the priority isn't there, then embedding is in that sort of enablement is pretty hard. I found it worked really well at AWS to inject a highly skilled person right when the service was launching because typically the service team members didn't really have any understanding of how to operate in production or deal with all of the various processes. And so having someone sort of help them train those muscles helped. Right. So now we can enter the audience Q&A. Not that we haven't taken audience Q&A already here, but if you have any questions in mind, now is the perfect time to ask them because we have a bit less than 10 minutes left. So type them away. And while we wait for people to type their questions, if they are now frankly typing away, let's ask the last question from my side. So it tells about the best. Oh, we have immediately a question. So let's jump into that and leave my question a bit later. So have you found that those with certs, such as Azure, AWS and so forth, are better at thinking through designing and operating reliable services? What's meant by certs here? I think it's cloud certification programs. Oh, certifications. Yep. What's your view? So I have a definite view of this, which may not be shared by other people. He wanted to warn everyone. Yeah, like you say, Lauren, I have no certs. I never met, in 11 years in Google, for example, I never met a single person that had a vendor-related cert. I think that's true. And in general, I'm not saying, I'm not saying you should be suspicious of them. Like there's certificate having people are people who said, this thing is important and I should work towards getting knowledge about this thing, which is a positive signal, like hugely positive signal. But often, like that's an abstract way of representing the situation with respect to certs, the concrete way of representing the situation with respect to certs is often like there's a very fixed set of knowledge you have to have and that fixed set of knowledge might not map onto your problem domain but kind of changes quite quickly as well. Like I know back in the very old days like the networking certifications that Cisco used to do, CCNA and all that kind of stuff, you would spend 36% of your life learning about the difference between type two and type three LSAs in OSPF. And how relevant is that to you really? Probably not that relevant. I suppose it helps to distinguish you from the larger mass of people formally in some sense who have no experience with these ideas but it isn't in any way a guarantee that they will successfully wrestle with your particular problems. Yeah, so I kind of agree with you. I feel like you can kind of have three tiers of people. You know, there's the tier who actually kind of don't know what they're doing and then there's a tier where, you know, they've shown through some sort of certification that they kind of know what they're doing and then you've got this tier that you really want or they really know what they're doing because they're working at it and they're way too busy to get certifications. And so the problem is, how do you distinguish between the top tier and the bottom tier? And, you know, assuming you can do that I don't think certs matter. If you can't do it, it matters a lot, right? I mean, you know, suddenly everyone's been a, you know, AI engineer for the past 30 years and I kind of doubt it, but, you know, that's how they describe themselves. Okay, anything to add, Steven, anything? Okay. And then we have one question from the audience and I also wanna address, Shubhan, you're asking about the profile as a community member. I think there's gonna be a good resources online about how to change your profile around. I think we don't have the time maybe to handle that question here, but I do wanna get back to Lauren's earlier question about any great books or articles can help guide monitoring and observability of automated services so that alerts are meaningful and actionable for teams. Yes, so there is another book, yet another book called the service level objectives book, primarily written by Alex Hidalgo, the SLO book. It has a chapter in it about SLO monitoring, which happens to be written by somebody called Niall Murphy, who I've never met. And this Niall Murphy wrote about how to do pretty concrete steps with respect to SLO-based monitoring and observability. I've been told it's a really good introduction that might be a place to start. There are a lot of other monitoring things, like resources out there. Honeycomb has an observability book, which I think is very good actually. Yes, excellent. Stephen's holding it up now and written by many of my favorite people. And there's also some sections in the SRE books about monitoring as well. And something from James Turnbull, I think a couple of years ago called The Art of Monitoring, which I also think is freely available online. Yeah, loads of places. There's an annual conference called OliFest. You can just go to olifest.org, Oli with the ones. And I think you can still watch them all for free. There's tons of really good content there if you wanna watch those. Monotorama. Monotorama, yeah. The other thing I'd say is that pretty much everyone who's luminary in the SRE community is sitting on Twitter and you could just reach out to them with your questions and they're pretty nice. So I follow a lot of people there and they're pretty generous with their time. So yeah, can do better than reading some dusty book. Great, no, anything's out there and I'll still... I was gonna say that also chat GPD might help you to write your code. And they might be more available than the time of some of the people on Twitter. But yes, folks are generally available. Yeah, perfect. But I think that's all that we have time for today. But great to see so many amazing questions and answers from everyone here. And thank you always, everyone, for joining the latest episode of Cloud Native Live. It was great to have a session of our reliability here today. And also love the interaction and the questions from the audience. And you can always tune in in the coming weeks. We have more great sessions coming up and thanks for joining us today and see you all in the coming weeks.