 Hopefully everybody had a good lunch. His name is John Howard. He's a software engineer extraordinaire. I think you guys have already talked with him. My name is Amir Abbas. I'm a product manager at Google. Today we're going to talk about reliability. I did a talk last year in Amsterdam. This is a variant of that. We'll put a little bit more Istio spin and talk about some architecture patterns that we're seeing and some best practices. But I know it's after lunch. Everybody's had food and dessert. But I wanted to start with a thought-provoking question. Can you build four-nine services on three-nine infrastructure? And what's interesting about this question, you can take it as rhetorical or not, is the answer is not important. And I think like many thought-provoking questions, the answer is it depends. But what's interesting is that usually when I ask this question, the room is split into half. Some people think, yes, you can. Some people think, yes, you can. Even a room full of SREs, there's a split between them. And I think it really just kind of depends upon what school of thought you're coming from. So if we look at traditional ways of building software, you have data centers, you have redundant power supplies, you have compute, and then you have hypervisors, containerization, any of your applications sitting on top. It looks like an right-side-up pyramid from a reliability perspective. Each layer inherits reliability from the layer underneath. Couple of things are happening. First of all, it's very easy mental model to understand. If the computer is more reliable, the thing that is running on top of that computer is going to be a little bit less reliable, because it inherits the reliability from the computer. But the other thing is that the entire onus of reliability relies or is put on the IT team. It's put on that. And the application teams just don't have to worry about it. In the cloud, it's reversed. In the cloud, they look like upside-down pyramid, and you probably are wondering, why did we even do that? The answer is scale, first, which is written on the server. The second one is cost. It's very hard to run scalable applications on very reliable infrastructure. But it does two things. First, it flips the mental model. We have to start thinking about architecting applications a little bit differently. But it also puts the onus of reliability also on the development team, on the application team. So this is how all globally scalable consumer apps run on these distributed models. The key is to pick one or the other. If we don't assign valence to this, if one is not better than the other, I'm not here to argue that. Obviously, I think we are here in KubeCon. We probably tend more towards the right side of this diagram. But regardless of where you sit, just know which school of thought you're coming from. The biggest mistake people make when they're doing a lift and shift, let's say, to the cloud, if they're doing it for the purpose of reliability, without architecting your application, you may not get that. A lot of the times, at Google Cloud, for example, our VMs have a 2 and 1 half 9 SLA. So a lot of the times, a customer asks me, can I give you more money to give me a much more reliable VMs? And we said, no, you have to start thinking in this particular realm. So we're going to talk about reliability. And you can't really talk about reliability without talking about failure domains. And before we even talk about failure domains, I really just sort of back up and talk about, what are the needs of a service? And I think every service sort of uncontroversially needs these four things. So every service needs to be resilient, both to known and unknown failures. So if something goes down, it should remain alive. Every service needs to be scalable. You want to run it on many, many computers and clusters and zones and regions. Every service needs to be manageable. You want to be able to get telemetry and SLAs and those types of things. And every service needs to have some way of developing a policy framework that you can enforce. And when you start thinking about the footprint of this service, so let's say you're just running this in a single cluster, you can always get two or three of these things. But it's very hard to get all four. So in this case, let's say I'm just running this service, a front end, as just a single deployment, maybe even a single pod. It's running in a single zone, in a single cluster. Yes, it's very, very manageable, because it's a single pod. I can manage that. It's also probably very easy to secure, right? It's easy to manage. I can think in those terms. But it's not scalable, and it's not resilient. So singletons are not good. So how do we adapt these new patterns of distributed systems without losing the manageability and without losing policy? And that's sort of the problem we're trying to solve. In cloud, at least, we think of failure domains. We simplify it in just these two realms. So failure domains are either regions or zones. Zones are akin to data centers or campuses of data centers that are co-located. And regions are simply collections of zones that are geographically near one another. Yes, there are other things like Kubernetes clusters, or VMs, and databases, and so on and so forth. But everything resides in one of these two locations. And if you think in those terms, and if you think these are the things that I want to go resilient against, then it's a lot easier. Within those regions and zones, you have products, or you have services, I should say, that are running, depending upon where you're running, in which cloud, or data center. And they have their own sort of domain. So VM, for example, is zonal. So it relies on the zone that it lives in. If it goes down, it doesn't affect anything else. But a zonal outage can take down a zonal resource. Kubernetes cluster could maybe span zones. It could be a zonal cluster. It could might span zones. And what that means is that within a region, a Kubernetes cluster can withstand a zonal outage, one or more zonal outages. And then you even have global services that run in multiple regions, like load balancers or databases, and so on and so forth. When it comes to Kubernetes clusters, the largest footprint of a single Kubernetes cluster today, at least, it still remains a region. So as customers start getting into multi-cluster architecture patterns, this is where sort of things start to get a little bit more challenging. And Kubernetes folks are doing a really good job in making multi-cluster a little bit easier. But I think with Istio, we've got some really good patterns that we're doing. Yeah, so to architect things reliably, we often need multi-cluster. And there's a lot of good features of Istio that can be leveraged to help out here. And we're going to go into some more concrete examples of how you can use these for different architectures in a bit. But some of the main features Istio offers here to help is we've got service discovery. So Istio can connect multiple clusters. We saw a bit about that in some previous talks and even multiple different networks and allow communication between all of them just like as if it was a single cluster. On top of that, it can also do traffic management. That includes all the standard Istio things like canary rules or header manipulation, routing. You name it. But there's also all sorts of aspects that are specifically related to multi-cluster, such as location aware rules so we can keep traffic within a zone or within a region, automatic failover between zones, things of that nature. We've also got security and policies. This is not super traditionally related to reliability, but if your application is compromised by an attacker, for example, that could hurt your reliability. And the policies and security that Istio provides also carries over across multiple clusters, which differs a lot from something like network policy, which is generally only applying to a single cluster. And finally, we have observability. Again, observability is not the thing that's giving reliability, but it's very important to be able to observe your system so you can actually understand if it is reliable and if you're seeing errors and what those errors are and why. And Istio can provide a single pane of glass to observe all your applications, even if they're across multiple clusters. On observability, I get asked a question a lot that I have services running on multiple clusters, even multiple regions. I just want to see just a global metrics view of how everything is running before I can dive into individual clusters. I want to introduce this concept of archetypes. We talked about it in last KubeCon as well. But just to give you a model of how to think about running distributed applications, we start with archetypes. Archetype is nothing. It's just a fancy word for an abstract model. So it is infrastructure agnostic, and it is also product agnostic. So here you start thinking about things like, what is my replication strategy? What does my DR flow looks like? What are my recovery points and times per application? You start looking at an application and you decide what are the things that it needs? What model would it fit under? You haven't even chosen a runtime or a database or a CICD strategy, any of that. You're just starting to just think about big picture, high level, from a reliability perspective, what each services needs. From there, you get into architectures. And architectures just basically takes that archetype model and then start implementing products and services in there. So this is where you say, for runtime, I'm going to use Kubernetes. For service mesh, I'm going to use Istio. And for CICD, I'm going to use blah, blah, blah, for databases. That matched that archetype. Archetypes are universal, so you can switch between runtimes and or between services, going from Istio to a different one or going from Kubernetes to cloud to some sort of a serverless technology, your archetypes remain the same. On top of your architectures is your applications. And applications are constantly evolving. Obviously, the application, you're adding features to that. Also, it's evolving in the matter of footprint. And what that means is where the application is deployed. Again, you might start with a singleton. You're running the application in a single cluster. And then you may decide that it needs to be multi-regional, and then it needs to grow globally, and so on and so forth. And you have to start thinking about it. And then when you start managing that application, we manage that using SLOs. And this is a quantitative way of measuring the health of every service. It's contracted between your consumers or anybody that is using your applications. We've published this paper. This is the only link, I think, in the entire presentation that if you want to write down. It's a bit.ly slash app archetypes. There's tons of archetypes that goes into much more detail on how it is. But from a reliability perspective, it really is a variant of zonal, multi-zonal, regional, multi-regional, global, hybrid, multi-cloud. We're going to not talk about hybrid and multi-cloud just for the sake of time. And we've picked five of these archetypes that we believe majority of the applications will fall into one or another. And we're just going to talk a little bit about those. All right, the first one up, kind of skipping over the most basic archetype of a singleton, is this active passive zones. And what this looks like is we take the full application stack and replicate it into a separate zone. But we don't send any traffic there. And if there's an issue, then we do kind of a manual failover action and start sending traffic to our replicas. So this is kind of a manual operation, generally, when an issue is detected. Traffic is manually operated by a human SRE over to the other zone. And eventually, when the issue is resolved, they can bring it back. So when evaluating this archetype and others, like there's a few areas we want to look at, it's kind of what are the operations that a human has to do when we face errors? It's what kind of failures are we resilient to, like is the zone going down, is the region going down? What kind of availability we can get? What is the cost? What's the complexity? So in this case, the complexity is fairly low. There is in the typical flow, we just have kind of a single zone. There's nothing fancy going on. The main complexity risk here is that the passive zone becomes stale because it's not actively tested at all times. The cost is also fairly high because we need to fully replicate the entire application. I can be improved a bit by having fewer replicas of some of the services. In terms of what failures we're resilient to, we can survive the one active zone going down. But if the entire region goes down, then we'll have issues because the passive set is in the same region. One thing that's kind of unique about this approach, though, is for licensed software because the passive applications aren't serving any traffic. Generally, it's possible to actually reduce cost in some sense because the passive zone doesn't need to have the license activated until the failover occurs. So we have a bit of math here. If you want to go into way detail about all the math, you can check out the talk from Amsterdam, I believe. Or last year's. Yeah, but the general trend here is that by having two different zones that are kind of in parallel, we kind of aggregate the availability of them, and we're able to get more reliability than any single zone. So here and in future ones, we're showing SLA numbers from GKE and Cloud SQL just to give concrete numbers. But you could find similar from any other cloud provider, I assume. In terms of implementing this one in Istio, it's actually generally you're probably not going to use Istio for this. And the load balancer would be something external Istio. But if you were using Istio to do this as a load balancer, Istio has a variety of ways to route traffic like this. So one example would be to use the locality of load balancing to specify which zone the traffic should go to. When there is an incident, that would go be changed by the SRE to split from 100% in zone 1 to 100% in the other zone. Yeah, I was talking to, I was having drinks with an SRE. And I said, what do you call an action that you do when something fails? And then he came up with this term fail ops. So we just put words in front of ops and just call them something. That basically just means do you need to do anything or not. This is the second archetype. The first one was active passive. This one is active active. Or in this case, it's active active active, depending upon how many copies you want. A little bit of refactoring is needed. This is where you start to separate state from statelessness. Maybe not license applications, maybe not COTS application, typical consumer web front ends, web applications work. Cost actually goes down a little bit here. If you want to do the math, if it's a three zone environment and you want to design for a single zone failure, so imagine that you're serving 50, 50, 50 in all three of these. So when one zone goes down, you still have 100% of the traffic that can be served. Global load balancer automatically fails over. In GKE, it would look very similar to the last one. Only in this case, you are splitting the traffic from the initial load balancer across all three clusters. The numbers remain the same. And in the Istio implementation, it would look just like a very simple destination rule with locality turned on. You don't put any manual failover in there. And Istio automatically determines the closest zone and then moves to the next closest and then the next and so on. In this one, we did want to mention with Istio, there's a couple of ways to do this. So in the archetype picture, there's a concept of an internal load balancer between every single layer because we're in multiple zones. So you need to send it to some sort of a load balancer that can send traffic between one zone and another. With Istio, you have both models. With the sidecar model, you don't need that internal load balancer. The sidecar does act as that internal load balancer. So if the front end and zone A can talk to all three backends, A, B, and C, and it'll just failover automatically, or if you're using something like an ambient mode, then it would look very similar to with an internal load balancer, which would be your waypoint proxy. Extending the first and second archetype, we slap these together and we have this active passive region. Very, very similar to the other two, except that instead of failing over to a different zone, we're failing over to an entire different region that would be doing the zonal deployment there. So very similar on both of these. Typically, how this would be done in terms of the top layer would be something like DNS doing kind of geo-wear, or not geo-wear DNS that's coming later. Do the failover policy at the DNS layer, or there could be multiple tiers of load balancers. So the benefit of this one is that it can survive an entire region going down because we failover to the other region. So again, the math here is quite nice, very high availability because with two regions, it's quite unlikely that both regions go down at the same time, I hope. By the way, just to be clear, these numbers here are the maximum that really can be achieved here. That's what I was going to say. There's many ways that you can have lower reliability. So in terms of the Easter implementation here, there's generally not actually much Easter involvement here. Generally, each of these regions would be entirely separate meshes. They would not know about each other. So in the Istio multi-cluster terminology, you probably would not link these clusters together at all. All of the routing into which cluster or which region or which mesh should be used is at a higher level outside of Istio. And then once the mesh is chosen, then it can go into the standard Easter routing, which is kind of more like the previous archetype where we would use the locality load balancing to keep things within the zone as much as possible but failover when necessary. Yeah, I believe there was a talk this morning that actually talked a variant of this archetype. This was just interesting. Whereas previous one, there's no sort of ingress and egress control, you can just manually failover. And maybe the manual failover is because the data is not fully replicated and you need to be able to manually decide when to move things over. This one is isolated regions where you just cannot move serving traffic over between regions. So it's very, very similar to the last one. In this case, you can create two separate trust domains for Istio in each of these. This is for regulated industries. When I talk to EMEA customers, this comes up quite a bit. I want traffic to be localized in a specific region. In this case, you have a choice, whether you want to just archive data in a different environment, but the serving is always served from one. And then you have the manual control over DNS. So imagine that you're sending your EMEA traffic to EMEA regions and APAC traffic to APAC regions. So if something goes down, it's only going to affect half of the traffic. And yes, you can use each other as backups if you'd like, but that just totally depends upon, it totally depends upon the regulation. And the Kubernetes slash Istio world, it would look exactly the same as the last one, except the Istio trust domains will be separate. Another thing I want to just point out that, yes, the numbers have gone up, but you can also see the complexity goes up, right? So there's cloud DNS or some sort of a DNS, and then there's regional load balancers, and then there's six clusters to manage those types of things. So your ops cost kind of goes up, but you do get a much higher reliability. In Istio, it would look extremely similar to the last one, you would have two istiological meshes and two different environments that don't know about each other. And if you do need to do any cross cluster sort of a thing, you would just use ingress patterns, otherwise it's identical to the same. All right, the last one up is the global pattern. I think this is the most fun and interesting one. This is like something like YouTube or Google search or any other large scale consumer application that's highly available, highly replicated, we run as. There's a lot going on here. So kind of at the top of the stack, we have a global anycast load balancer that can accept traffic from anywhere in the world and kind of send it to a load balancer instance that's close to the user so they can reduce latency. Next up at kind of the bottom, we would have a global database, something like Spanner or Cockroach DB or something similar that can be accessed again from anywhere in the globe, but generally we'll do some awareness of the region so that it doesn't increase latency cost too much. Then we have multiple regions with multiple different clusters and each service is replicated across all these different regions and zones and clusters. And at each hop from say front end to back end to multiple tiers of applications, the traffic can flow between any of these. And so this is something that ECGO will help facilitate and we'll generally this will be done in a zone or location aware way such that we're not sending traffic from US to Europe to Asia back and forth and have crazy latency costs, right? But we're able to if there's failures or some services are going down in one zone we can easily fail over. So the cost here is higher but it depends on exactly how much higher based on how much you're replicating the data. So in this case we have the front end is replicated across six different zones, three regions, six zones, six clusters. So that replication cost is fairly high but we also get a corresponding high availability out of this. The complexity here is also again fairly high but again it comes with a lot of nice properties. In particular, unlike some of the other modes there's no fail ups required, right? When a particular back end goes down ECO can detect that and automatically fail over. So we don't need to page estuary in the middle of night and say hey go trigger the rule to start sending traffic to the new zone, right? That's all automatic by ECO. So again this looks something like this we'd have six clusters here in parallel. We can choose at each hop along the way which cluster which workload to go to based on the health and latency of each hop. Okay so as you can see the complexity goes up you get more reliability but also the cost goes up so then people ask like well how do I use these archetypes to my service? So the way we think about it is a service belongs to one archetype. A service let's just say for the sake of this conversation is a Kubernetes service. An application can have a collection of services and you can put services in different archetypes and then you're probably saying well why would you wanna do that? And the reason is obviously to save cost but you also wanna do that to architect for what we call graceful degradation. So let's say in this example service D is just some low priority service. Maybe it's an email service that just sends an email out and if it's down it's not gonna affect anybody you're just gonna get delayed emails. The rest of the application can continue to work you can save some cost. So when you build an architecture it would look something like this with multiple services. Here you have service A running as global maybe that's your front end, service B running as regional, maybe that's some catalog that's regionally significant and then service D that is maybe not as important. Yeah so some of the benefits of this as mentioned like because we don't necessarily need every single service to have the highest amount of availability which increases costs and increases complexity. Sometimes it's the right choice to run in some of the kind of lower tiers of the archetypes that are less available but they're simpler to operate and they have less cost. And what that does is free up a lot of capacity for other services to run. So if we had some fancy new AL or AI or ML workload that we wanted to try out we've got room in these green clusters because we didn't make service B and service C more available than we actually required. I just want to say for the record we waited 23 minutes to say AI. All right so let's sum it up. Thank you. Let's sum this up. So basically what we're saying is most people start with architectures. They immediately like oh we need to put stuff on Kubernetes and we need Istio and we need Argo and those types of things. We recommend you starting with archetypes and you can by the way use archetypes at different dimensions. Maybe they're security archetypes and regulatory. We talked about it from the lens of reliability. So look at it from that particular perspective. Read the paper. I'll show the link here again. Then you can go into the architectures. We think Istio and multicluster Kubernetes match made in heaven works really, really well. It's been working well for a number of years. Kubernetes is doing a great job but you can use Istio to accomplish majority of these as we talked about it. Individually, Istio on a single cluster is great but Istio on multicluster is a very, very good experience. Here are the links. You can learn more about multicluster on Istio.io or bit.ly slash app dash archetypes if you're listening to this on YouTube later. I will say that out. Otherwise, thank you so much for coming to our talk. Thanks everyone. One minute. Do we have time for questions? Maybe we have five minutes for questions. Hi, Lynn. I guess we can. All right. We will be here for five minutes if people have questions. I think we've got one over here. Oh, we have questions. Archetype five. No, but I could hear when you shouted. Shout it. I'll repeat the question. I think you're good now. As of legal, right? Bunch of laws make us to make weird decisions. When you disable the service in that region or in one of these, you get an imbalance in the other regions as well. So do you have like an approach to solve this imbalance because it's complicated to solve it now because it doesn't support advanced load balancing weights or any other thing? Yeah, it in many ways depends on the application. Ideally, so there's always kind of this trade-off if I have a small number of local instances or a large number of remote instances, which one should I send it to? I know that from the location standpoint it's better to send to the local ones, but I may overwhelm them and it may be counterproductive, right? So there's a few ways to handle this. Istio has recently changed its default to be kind of the least request load balancing, which to some extent helps solve this by being kind of latency aware. The faster service will finish request faster then we'll have less requests so it will have traffic preference there. Some other tools are kind of outlier detection, for example, so you could say if the latency of this service has exceeded this amount or if it's starting to return errors, that's no longer healthy even if it's passing the traditional Kubernetes probes. And so that will kind of be evicted from the pool and encourage sending traffic to the remote endpoints. Those are kind of the two primary tools for that. So I mostly had a question about, say you have multiple control planes in the same region so you're not as much worried about the regional failures or even zonal failures, but you're more worried about someone doing something to a control plane and you just want to make sure your control plane fail over, well, not necessarily fail over, but you have redundant control planes or redundant clusters even within the same zone. I suppose that's also where you would make your meshes or your networks connect together in a multi-master setup. So even if you're not going to multi-region or even multi-zone, you would still be able to employ the last diagram in the series for that. Yeah, so that's a very good point. So we didn't really talk about running the Istio control plane here, but there's a lot of similar decisions to be made of how many of them you run, how do you replicate them per zone, per region, per cluster. Istio allows you that flexibility and probably could be a bit more opinionated on suggestions on which one you should run in which cases, but in many ways, like running your own application, it's about the trade-off of complexity to availability requirements on which choice is kind of best for you. I did want to add one thing to this. So the reason we talked about two failure domains, regions and zones is most people aren't even doing that. I know they're kind of, the reason is because you can go fairly granular and you can start doing resiliency and redundancy for every single component, infrastructure component, control planes, API servers and stuff like that. There's a concept in SRE called generic mitigations. And what that means is that if there is a problem with the control plane, you just route the traffic somewhere else at a bigger, at a higher failure domain. And zones and regions just fall really well within that. So if you just architect for zonal failures and regional failures, everything just gets subsumed within that. Those are great points, but let's say there is a failure within a sub component within that zone. You just simply assume that it's a zonal failure and then you just move over and then you can fix that and figure out what's going on. But I don't think it's worth your ops effort to just make every single component duplicated. That's question. About stateful applications. Do you have the supplies to stateful applications as well or only to stateless applications? Yes, we didn't hit on the stateful part as much because we wanted to, first of all, we had 20 minutes and we wanted to stay closer to the Istio. If you read the papers, it will talk about exactly which database technologies do what. Certain archetypes are actually architected based on the database technology. For example, the active passive ones because we know the replication time might be too long where you can't just automatically move an application over and you also have to do the replica switch over to make sure that that becomes now the main. So yes, the paper talks a lot more about databases and stateful stuff as well. We just didn't hit on it. Yeah. The paper is pretty good too. It's very approachable. The paper is, yeah. It's actually a part of the Google Cloud Docs so that link just takes you to like an architecture center and yes, I mean, feel free to go to it right now and you should, and it's very approachable in a sense that it's divided into I want to build zone elapses. What are some of the archetypes? I want to build regional elapses. What are some of the archetypes and what type of applications? What's my database strategy? And it doesn't tell you any product so you can use them with any products you like. Yes, I can go back. What did I do with the one slide back? I think we'll be in the back. I think we're out of time but thank you so much. We'll be in the back for more questions. Thanks everyone.