 Welcome to IstioCon Virtual. Today we're going to talk about how you start with application archetypes. You can build reliable systems. Don't worry, we'll get into all of that shortly. My name is Amir Abbas. I'm a product manager at Google. I work on Service Mesh. I'm also part of the Istio Steering Committee. Glad to be talking at IstioCon again. Steve. Steve McGee. I'm a reliability advocate at Google. I was an SRE for about a decade inside of Google, and then I left and became a customer. And I learned how to build things on clouds. I came back to Google to help more people do that, focusing on reliability. Thank you. So let's start off with a question. So try to just answer this in your head. Is it possible to build four or nine services that is something that has 99.99% availability on top of three nines infrastructure? So can you build something that is more reliable on top of things that are less reliable? And so I'll put it out there that this is one of my favorite questions because whenever I ask it I get the room basically is split in half. Yeses and noes. So it's not a straightforward question. It's not an obvious answer. I would tell you that the answer actually is yes. And I hope by the end of this talk you'll agree with me. So let's get into it. The way that I think about this and the reason why it comes up in kind of like people have different opinions about this is because of this thing that we call the pyramids of reliability. And here they are. So the traditional model of building reliable systems is just like houses and buildings and architecture, in this case a pyramid. And that is the thing on the top has to depend on the things below it. So if you think about your application, it has to depend on the infrastructure below it. And so that means that the infrastructure needs to have essentially more nines than the application on top. So in modern systems on cloud it actually flips around. And that's because we're taking advantage of something called distributed systems. And that allows our system to degrade gracefully. It can have partial failure and still present a completely accessible face to its users. So it actually appears to be fully intact when it might be partially degraded. So the important thing here is if you're building your app as if it's on the left, but you're building it on cloud, you're already on the right side. And so you're actually going to have a little bit of trouble matching the old app to the new cloud system. And we see that time and again where someone takes the top of the little pyramid and tries to stick it on top of the upside down pyramid and the whole thing falls over. Right. People don't really get what they hope out of a system. Here's like another kind of an x-ray into that same thing. So you can imagine in the traditional center you have your data center, which never goes down. You have an infrastructure network and storage and storage arrays and all these things. And you know the availability of all of them because you're trying to hit some application availability at the top. In the cloud, you write your own application on top of a platform that you manage and we'll talk about in a minute. And you're building that all on top of IaaS. So a way to think about this is if you're building a platform and an application that runs on top of VMs, VMs at most clouds have two and a half nines. And so if you want to have more than two and a half nines in your system, you have to be able to make it go kind of up as you go up the pyramid. So the way to think about this is on the traditional model, you're inheriting the availability or the reliability of your base. And on the right side in the cloud model, you're actually, you need to do work to improve the availability of your base and you can do that. It is possible. So the way we do it is through something that we call archetypes. This is going to help you think through it. First off, what is an archetype? What do we mean by that word? It's really just an abstract model and specifically around reliability, through a reliability lens. So we want to think about what types of replication are we going to use at what rates and what places, what level of redundancy do we care about in our storage? What are our RTO and RPO? What does our DR scenario look like? How are we going to handle failure? And of course, cost as well. These are all important assets and ways to think about our system. The important thing is that we're not talking about products. Not yet. The next step, architecture, that's when products come in. So if you think about the archetype first, you actually are going to whittle down your requirements to only a few products will fit your needs. So this actually makes it a little bit easier to choose your products in this next step. So this is when we choose things like Kubernetes and Mesh and which CI CD we're going to use and what types of storage and backup we're going to use. And they all have to match the archetype we already chose before. Next up, we're going to write some code. So now we have to actually build our app and our services. And then we have to deploy that code to different parts of the world. We might be in different regions for different types of customers around the world. And the important part here is that this service is always being changed because we're always innovating and pushing our code several times a day potentially. And even the footprint changes as well as we get break into new countries and regions. And we're going to be adding new data centers, things like that. So this is always in flux. The last one here is how we're measuring our system. How is our system actually doing every minute of every day? And this changes all the time. So this is not just when we're writing changes to our system, but I like to think of it as like data center weather. Stuff happens in production. And so you have to be prepared to understand that you have to be able to measure it at least. And that's where SLOs come in. So think of this as the tightest little loop. This is all we're changing inside every minute. And in the app and service, maybe we're making deployments every few times a day. The architecture, we may change every once in a while. And the archetype itself is pretty stable. So because of this, you also have to remember, if we change the archetype, it has a trickle down effect to this entire system. So we want to make sure the archetype we choose is going to be pretty stable. Because if we have to change it, it's going to do a lot more work. So we want to get these right up front. So let's look at this more visually. So imagine that you're building a product. I'm a product manager. I start up with the product. So your company can have multiple products. So that's the outside box. So let's say in this case, we're selling some sort of a fashion lifestyle. A product can have multiple applications. So in this case, we can have maybe like an online store, we can buy these clothes, we can have a marketing store, we can have, you know, web analytics or other applications within that. Within an application, you may have multiple services. This is maybe starting to look a little more familiar. So this is kind of like Kubernetes services. So for example, the online store application may have a front end. That's where users can go to buy stuff, and they have a product catalog, and they have a checkout, email, and so on and so forth, right? So these are your services. And then within your services are your workloads, right? So these will be actual functions and tasks that are running to make that service happen. These services and these products and these applications are running on top of a platform. The simplest definition of a platform are a set of capabilities that you abstract away from your developers. So they can just do their job. They can build products and applications and services. And if you notice that we didn't use any product names here, right? So these are capabilities that your platform provides, for example, container orchestrations or virtualization or serverless or CICD, or you may not even have CICD, right? So you may have developers open up a JIRA ticket, and then you go ahead and provision, you know, Kubernetes clusters or VMs and networks, storage, and then you provide access to those, which I think is a perfectly good... Yeah, people start with that all the time. We call that paperwork as a platform. Anyway, you can abstract all this complexity away from the developers so they can focus on their job as a platform. You can think of your platform also as a product, right? So internally at Google, our platform is called Borg. You can also name your platform. In this case, we're calling this platform Cylon for any Battlestar Galactica fans out there. So this archetype is kind of a new concept. So where does this fit in? So the way to think about it is archetypes are different capabilities that your platform provides to your services. So for example, let's just assume that this archetype supports these three application archetypes. Global, active passive region, active passive zone. Don't worry about it. We'll get into that later. But now you can map your services to these architects. For example, your front end of your online store needs to run globally, needs to run everywhere because you want the store to be accessible worldwide. You can run that as a global archetype. But your product catalog might be geographically significant. Region A may have different products than Region B. So you may want to run this regionally and you want some reliability so you run it as an active passive region. So now you have these two products. So you have your platform, that's your internal product. In some cases, people even call it an internal developer platform. And then you have your actual external facing products. We'll just call it application just so we have two different names here. So where does reliability come in? So you can think of this dotted red line as your reliability line. Anything above that is healthy. Anything below that is not healthy. And we measure that through SLOs. We'll get into what SLOs are a little bit later. But we have collectively now decided that this is the line that we want to stay above. You almost think of it as like surface of the water and these are containers floating on it. And if they're above it, they're healthy. If they're below it, they're not. Containers, I get it. Yeah, we should have put sales like the SLO. Notice that we did not draw this line right below the thing. We don't want this to be 100% of there's always going to be something broken. And that's just the nature of distributed systems. And we're okay with that. As a matter of fact, we need that. This is where sort of innovation happens. And this is sometimes we call this error budget and so on. There are two other concepts I want to talk about risks. There are two types of risks, known and unknown. And there always be those two types of risks. The more you know, the more you don't know. And you think of risks as forces that are constantly trying to sink your products. So both the platforms are breaking. Your applications are breaking. And then you put these resilience or these countermeasures to mitigate those risks or defend against those risks. And you think of these as forces that are trying to keep these things afloat. And that's okay. As long as the SLOs are there, you can monitor those. In some cases, the risks may push these platforms and your applications further down and you're breaking your SLOs. In other cases, your resilience is doing good and you're putting countermeasures in place to do that. You're also measuring the SLOs and reliability separately for all the products. In this case, let's just say your platform has its own SLO and your application has its own. So your platform might be actually more underwater because maybe you're migrating from PMS to Kubernetes. As long as your applications are healthy, you are good to go. So again, this is this contract. Yeah. So the application might be the thing that makes your company money. And the platform doesn't have to be up quite as much, but you still don't want it to be down too much. So we'll get into SLOs later. But first, Steve, why don't you explain some archetypes? Sure. So this all comes from a paper written by two Googlers named Anna Berenberg and Brad Calder. You can see the link up on the top left. And this is one example of one of the archetypes and we'll explain it in just a minute. So let's start off with a simpler one, though, first. So this is one that you've definitely seen before. This is actually just the hot, cold architecture that everyone has seen. This is the idea that if we have something and want to be more reliable, let's just have two of those some things. And when one breaks, just use the other one. That's as complicated as it gets. So this is like a service or like a product agnostic description of that way of working. So in this case, we have a bunch of descriptors on the left here. The ones I want to point out are the survivability. So in this case, we're saying we're going to survive zone failure, but we're not going to survive region failure. And so when I say zone failure, it might just be that the entire zone somehow disappears or is not available. But it can also mean just that one service that is required in our stack, that one fails in that zone. So any and for any reason why this zone is unavailable for the service, we may choose to do a failover to the other zone. Also, we want to point out that we're not talking about two instances of a service within a single zone. We're always spreading our services into different failure domains. And in Google Cloud, first, for this specific example, these ones are called zones, you also see them called availability zones or other terms like this. The other term I want to point out on the left here is something called fail-lops. This is kind of a new term we came up with, which is when we have one of these failures at this either zonal or regional level, what is it that we have to do? Do we have to take some action in order to perform this operation of failing over? How do we recover from this particular type of failure? In this case, we're going to do two things. We're going to change the load balancer to point to the failover zone. And depending on the technology, we may have to promote read replica to become the primary database, something like that. So the cost is worth considering here as well. The complexity of this is pretty low. We see this all the time. Refactoring is sometimes none. Often, you can take an existing application and just do this to it. And this is why it's actually great for things like off-the-shelf software or any application that you have a licensing issue with where you really only want to run one of them at a time or something like that. All right. So the next one, we call this 3.1. And these numbers come from the paper, by the way. There's more than just the ones that we're showing you here today. In this case, this is like hot, hot, hot. So in one region, we're going to have three zones. And each of these three zones are going to be usable. They're all going to be hot. So in this case, we're still only surviving zone failure. And we're still not surviving regional failure. The fail-offs are a little simpler. In this case, you may have to do a database failover. Again, depends on the database you're using. But if you lose a zone, it may be that you just kind of keep working. You don't really have to do anything. The cost is actually a bit better here. The complexity has risen. You can imagine things like if you had an in-app server cache, that might not be possible anymore because you have hot, hot, hot. So you may have to push that down into the database, for example. So there might be some refactoring that's needed. We see this a lot with web services. So we recommend this model for basic web services. The next one, we're basically taking those last two models and we're sort of squishing them together. So we're taking that hot, hot, hot model within a region and we're making two of those. And we're allowing the ability to failover from one of these regions to the other one. So what's going on here? What's different? In this case, we're surviving zone failures and now also regional failures. So we're kind of expanding our resilience here. The fail operations in this case, the fail ops, are a little different. In this case, we're using DNS to point between the two regions. We might still have to do some sort of database failover process, especially if we're crossing regions. There might be more work involved. The cost is obviously different. This is more complex. There's more things to kind of tinker with here. This is what we recommend for something that is like a high availability web application. We see this all the time. The next one is actually very similar to this one. So keep an eye on the diagram here and see what changes. So all we're doing here is we're changing the, how the structure of the top changes from one single entry point to sort of two. In this case, what we're doing is we're just saying there are two copies of the application. They're both live and the client gets to choose where they go. So you might have to do this in a regulated industry where, for example, like one client is on the left side and another client is on the right side. And the reason why this is a good model is because you want to make sure that if one region is affected that the other region is not affected. So this is why these are called isolated regions because one is isolated from the other. You can update this so that customers or clients can be moved from the affected region to the surviving one as well, but that's kind of up to you. The cost complexity and all that doesn't change too much beyond that. And so this is for HA services, but generally this is for like more regulated HA services that have this requirement in place. The last one, this is the global model. I call this like the Google model because many services inside of Google are built this way. In this case, you're deploying your application to all three regions and all the zones inside those regions. In this case, we're surviving zone and region failure. And the important part here, I think, is the fail-ops is now nothing. It actually just keeps working. So if a zone goes down, no big deal. If a region goes down, also no big deal. You do have to make sure that you have the capacity in place. We have something called N plus M cost modeling to make sure that we can survive a certain number of zone failures and a certain number of region failures and not run out of capacity. The complexity of this is higher. It's pretty unlikely that you're going to refactor something from the very first 2.1 into this directly. These tend to be more of like the cloud native type systems will look like this. And mostly this is for global consumer services. So if you have something that's going to be used by billions of people, this is the model you should choose. So how are we going to use these different models? When we're building services, remember how there's services and applications that I'm going through in just a minute. The service that you build, you've composed it into like a microservice, that service can be deployed to one single archetype. You can't have it live in two different archetypes at the same time. So you make a choice. But the application that you're building is composed of many services. It essentially can take advantage of these many different archetypes and the advantages of each one, whether it's availability or maybe price or complexity. So some services you may want to invest heavily in and use like the global model and some services you may be able to get away with a much more cheap and easy to use model. So why would you do this within a single application? Why would you mix them up? The reason why is because we want graceful degradation. So we want to allow our system to have some systems fail occasionally and the entire system as a whole continue to work. We don't want to necessarily invest in all components of our system the most expensive way possible because we know that some parts of our system, it's okay if they fail or if it's okay if they're a little bit slow, users won't really notice. And because of that we want to take advantage of that. This is part of the beauty of distributed systems is that you can elect for some things to be less reliable than other things. And that's where we get a lot of the scaling and cost considerations in our favor. Amir is going to tell us about how we measured this now. So while the paper sort of ends at the archetypes, we wanted to take one step further and we wanted to use distributed technologies like Kubernetes and service mesh like Istio and see if we can actually build these. And while we're building these we also wanted to do a thing in parallel to see are we actually getting the reliability gains out of it. So I did a talk last year that goes into much more detail on what this is. I'll give you the 30-second, you know, two-bit version here. SLI stands for service level indicator. It is a quantitatively quantitative way to measure the health of any system through some metric like latency throughput availability. SLO stands for service level objective, which takes an SLI, puts a goal around it. For example, 99% of all get calls should be complete in less than 100 milliseconds in a given day. That's the one that we can collectively as a team, as an organization, as a development team, as a platform team agree on that that is the health of a system. So we're going to use that. SLIs are constantly changing. That's the blue squiggly line in this diagram. SLOs are the surface of the water. That's your goal. In some cases, your SLIs are healthier. In some cases, they're not some of the way of looking at it. So there are two ways to measure SLOs. So rarely we talk about aggregate SLOs. So imagine a system that looks like this where a client talks to service A, service A needs to talk to service B, talks to service C, talks to service D. So this is a typical microservices chain. All of these services are highly dependent on each other. In order for this request to pass, you must have every single service up for the life of this request. The numbers underneath are just made up numbers. But imagine that these are SLOs. Maybe these are availability SLOs for the individual services. So now in order to calculate the aggregate SLO of this system, we simply just multiply the SLOs together. Maybe intuitively, we think that, well, since all of them are three nines, the aggregate SLO of this system is also three nines. That's not the case. Multiply them together. Your SLO actually goes down. And it goes down a little bit. So you can see that now we're at 99.6%, which once you think a little bit more about it kind of makes sense because the chances of one of these services going down is a little bit higher than if this was just one service. Suffice it to say, we just call this the bad math. And the numbers are not that important. The direction of where it's headed is more important. So when things are connected in serial, overall, SLO drops. Another way of thinking about this is like this. So this is now a single service, service A that has four instances running. And your client can access any one of these services. It's providing the exact same service. So imagine it's going through some sort of a load balancer. And these could be four pods, four VMs. It doesn't matter. Same SLO individually. Now the formula is a little bit different. It's like a little bit screwed up. So I'm going to go back here. So now the formula is you multiply the failure domains together and you subtract one from it. So it's one minus 0.001. And since they're four or five, you raise to the four. And now you get a whole slew of nine. So you get 11 nines. The world is safe. We should just run four pods of everything. This seems really easy. I don't see what the big deal is. This is just the math, right? This is not how reality really works. So this comes from probability, right? These are all, if you were to roll dice, this is how the math would work. But if we're running four pods in one cluster, those four pods are not really the same as like floor dice roll. And the reason why is because they're not statistically independent. They actually share some fate. In this case, like they share the fate of belonging to the same cluster. They might even run on the same node, right? In that case, it's even worse. But at least they're in the same cluster. They're in the same zone. They're in the same region. They're on the same planet. They're in the same universe, right? So if we really wanted 11 nines, we'd have to invest in some sort of multiverse, multiversal load balancer. I think Marvel has taught us anything that multiverse exists, but we don't have a multiversal. We don't have that yet. We're working on it. Because you can traverse between multiverses, I don't think even that. We're getting off track. All right. So we'll call that the good math. So the idea is to know that you either have the trend that's going up in a cell, in an aggregate sense, with distributed systems, or a trend that's going down. So let's actually take those five archetypes, and we'll even throw in a bonus one, and let's build actual architectures with products. In a sense, we'll just use Google products, but note that you can use any product here. So this is architect 2.1. We didn't even mention that earlier, because it's sort of the least reliable, or there's probably no reliability built into this, but this is a very common starting point for a lot of folks that are getting into the Kubernetes business. So here you have a Kubernetes cluster that you've put in a zone and within a single region. You may have some database back end, like a SQL server, and then you have some front-ended load balancer. In this case, it's a regional load balancer. The numbers that you see underneath are the SLA numbers that Google publishes for their managed services. SLA just stands for Service Level Agreement. It's basically a commercialized version of an SLO. For the sake of the math, we'll just use these numbers as SLOs, right? And we want to just kind of do the math. So this example just kind of shows you how the math works. So in order for this request to pass, the load balancer needs to be up, the GK cluster needs to be up, and the SQL server needs to be up. So we just multiply the three SLOs together, and now you get two less than two and a half nines. So one of the observations you would have is that your aggregate SLO will always be lower than the lowest SLO of the individual system. In this case, the GKE, a zonal GKE cluster, for example, has two and a half nines. So your overall SLO is lower than that. And another important thing to note is that this is the best case scenario. So this is the best thing you should hope for that this will do. Or you can always go worse, but this is the best case, right? Right. And then the thing to take away from the idea that it's always lower than the lowest one is that now you know which one to invest in. We're not going to invest in making that four nines into five nines because it's not going to have any effect on this system. We're going to focus on the one in the middle and make that one better before we invest in the other two. So this sets the baseline. So now let's go into our five archetypes, starting with the active passive zone. So this is what it would look like. So now you have two parallel paths, right? So it's the same region. You have a regional load balancer. You have two GKE zonal clusters, in this case, or two Kubernetes clusters running in two separate zones, the same service. And then you also have a SQL HA pair, again, running in two separate zones. So your request is either going to go to the top or through the bottom. So when it goes through the top, we have a serial sort of a connection. So we do the series math, which is in the red up top. Again, don't worry about the numbers. And then since we have two parallel paths, we do the parallel math at the bottom, which is the green math. And then overall, you still need to have the regional load balancer and the rest of this in parallel or in series. So that's the final math. Anyways, suffice it to say the math works out and we end up with almost four nines. So just going from one Kubernetes cluster to two Kubernetes cluster and a SQL HA pair in different zones, within the same region, you can gain a nine or almost two nines. So it's better to do that. This is where technologies, by the way, like service mesh and Istio come into play, because service meshes can determine where the services are running and can fail over when services go down automatically. So let's keep going. The next one is multizonal. So this was zone, you know, zone A, zone B, zone C. So not much changed from the last picture. So now we have three Kubernetes clusters running in three different zones. Again, the same region, everything else remaining exactly the same. So now you have this parallel math happening at the GKE layer, which is at the bottom. You also have the parallel math happening for the SQL pair. And then we still have the serial math connecting the GKE pairs, the SQL pair and the regional load balancer. And now you get four nines. So we're headed in the right direction. And you may want to stop here. That's fine. But let's keep going. So now let's say you're growing and you want to have regional failover as well. So we'll get into the next archetype, which is active passive region. So in this case, we basically took the previous diagram and we just multiplied it by two. So now we have a blue region up top, same three clusters in three different zones. And we have a green region at the bottom, same three clusters in three different zones. So now we can have zonal outages or a regional outage. We have two regional load balancer. We still have the SQL HA pair, but now in two different regions. And then we have a cloud DNS that's sending either traffic to the top one or sending traffic to the bottom one. And now you can see that the math is trending in the right direction. Again, not important to focus on the number of nines, but the importance is that as you go above in these archetypes, you get better at solos. The next one was isolated regions for regulated services. The picture remains almost identical. So we still have the two regions, blue region and the green region. We changed the database to Spannerx. That's a true distributed database. So the idea is that your database journey is also seeing improvements, just like your Kubernetes journey is. And now your cloud SQL or your cloud DNS can send traffic to, for service A, or for people, clients A up to the blue region and for green clients to the green region. And the SLO doesn't change. And then the last one, which is the global archetype, which is basically run everything everywhere, looks a lot simpler. So now you have a global load balancer in front of as many Kubernetes clusters as you want. In this case, we're just showing you a two region model. So region blue up top, region green at the bottom, with a truly global multi regional, let's say Spanner or Cockroach, some of their database. This is where Istio is really, really useful because this is very hard to do without technologies like service mesh and Istio. And this is why we're presenting this. But one thing you'll notice is that the SLOs are interesting in this picture. So I'm just going to go back up just one so we can kind of see. So this was the multi regional architecture. Notice the bottom number has some nines in it. And now we're going to go to the global one. Notice the number of nines. So what gives Steve? What gives? Yeah, how come that the number goes up is the idea of these slides normally, but not in this case. So what we really are showing here, and remember, like these numbers are just kind of like informational, they're not really actual product number. Four nines is a lot of nines. That's important to point out. But what's going on in this model is this is much simpler. And remember that the nines at the bottom is the best case scenario. It's very easy to shoot yourself in the foot when your system is really complex. So if you can minimize the complexity of a system and still have its nines be acceptable, in this case, if you're cool with four nines, which is pretty great, and you can minimize the number of knobs you have to turn and failovers you have to make and all these things, you're going to have a far better chance of actually hitting those four nines than you would otherwise. So this is why we recommend this model. This is why we recommend managed services in general is just to give yourself basically fewer knobs, fewer things to go wrong. And you can still maintain a very high level of reliability. So one thing you'll maybe asking yourself is, we've given you five or six different architectures. Does that mean that I have to build these five or six architectures? Would not necessarily. So the idea is that and hopefully the astute amongst you recognize that the further along we went in the architecture archetype race, the other archetype, the early architect started to just kind of get consumed and subsumed into the later ones, right? So here, for example, we have the global archetype and Stephen actually built this on GK using Istio and Kubernetes. And here we're running three services in three different archetypes, even though it's the same architecture. So it's the six cluster architecture in two regions. Service A is running in the global archetype. So it's running everywhere, which means five of these six clusters can go down two regions. So one region can completely go down as well as multiple zones. Service B is running as a multisodal service. So cannot withstand regional outage and then service C is running as active passive zone. And so the idea behind this is graceful degradation as well as cost savings. So service C might be the email service that goes out when you buy a product. You still want to be able to buy a product. So service A and B allows you to do that. But if your email is a little bit delayed, it's okay to get less complexity and save some money. So at the end of the day, if you're responsible for the reliability of a complex service, you're really making a trade off between that reliability and the effort it takes to deliver that reliability. And when you look at things like all the products that are out there, all the vendors that are out there, all the different ways you could build a system, it just looks like this continuous line. And there's many choices along these lines. And it's really hard to know where you are at any given point and what you should invest in next. And so this is why we develop these five discrete points along this line that we call archetypes. And the idea is that from these points, you should be able to get a good understanding of where you're starting out. You know, if you build it this way, you should get approximately this number of nines. And you should know about how much effort it's going to be. Instead of having to start from scratch every time, hopefully these will help you get your off your feet and kind of know where you are. The other thing to consider when you're when you're looking at a chart like this is what you're going to put where, right? As we described before, things on the very far left, if it's off the shelf, you might be stuck just doing hot, cold. And that's maybe that's good enough, right? But if you know that you want to build like a globally, you're probably going to aim for the top right. It's really important to point out you don't have to push everything to the top every single time. Think about the curve here as just being like a curve on a hill like Sisyphus pushing the boulder, right? You don't need to go to the top every single time. It's going to take a lot, a lot of effort. This is why the curve looks the way it does. It's true. And in fact, you can think of it also in terms of money. So if you want to get from one step to the next, if you're down at the bottom, it's a little bit of effort to get a lot of nines. But the further you go along this curve, this exponential curve, you're going to spend a lot of effort. You're going to spend a lot of euros or dollars or whatever currency, and you're only going to get a little bit of nines, right? So this is an important kind of concept. Another way of thinking about this is we like to say in SRE that every nine you want to invest in getting like from three to four to five nines, every nine costs 10 times as much as the last one did. So it really adds up. Okay. So let's break it down. When you're designing a reliable system, think about the archetypes first. What is it that you really want to accomplish here per service? Because you're going to take those services and you're going to compose them into your applications that you want to be able to degrade gracefully. And remember, you should have teams that are resilient. These teams can think about your system and they can improve your system over time and they can respond to known and unknown risks. What you're doing is you're building a robust platform. That's a platform that has mitigations built into it. It can handle the known risks. It can handle events like regional failure or zonal failure. And really at the end of the day, the reason why you're doing this is because you want to deliver a reliable product. That's what your customers really want from you is the product itself reliable. Can I expect it to do what I think it should do? If you can do that, you're in good shape. I hope that helps. Yep. Thank you so much.