 Hi, everyone. I think it's time to get started. It's 5.25. So welcome, everyone, to this session, building a multi-cluster environment service mesh at Airbnb. And we have the presenters, Wei Bohi and Steven Cham. My name is Henrik Blichst, and I'm the moderator. So just a few couple of notes before we get started. So with the questions at the end, I'm going to be running around with the microphone. So once we get to the Q&A, raise your hand. I'll run over as quickly as I can with the microphone. If there are questions we don't have time to cover, please do that out in the hallway with the presenters so we can clear the room. And also, lastly, don't forget to rate the session in the schedule app when we're done. So just quick intros, and then we'll get going. So Steven is passionate about large-scale distributed systems, open source, technical leadership, and engineering excellence. And he's currently focused on solving scaling challenges of infrastructure high-growth companies like Airbnb. Wei Bohi works on building performance scalable and resilient distributed systems in the cloud. And he's currently focused on building the next generation service mesh at Airbnb. And with that, I'll leave the floor to the presenters. Thank you. Thank you. Welcome, everyone. My name is Wei Bohi, and this is Stefan. We're both engineers from the Cloud Foundation team at Airbnb. Today, we are very excited to walk you through our experience of building a multi-cluster, multi-environment service mesh on top of Istio on Airbnb. Here's the agenda for today. We will start by introducing where we are at our service mesh journey. Then we're going to talk about why we need a multi-cluster service mesh and how we deploy Istio to enable that. After that, we will cover multi-environment support, including multiple tier mesh expansion and external services. And finally, we're going to end with key takeaways. In the past few years, Airbnb, like many of our peer companies, have transitioned from the monolith architecture to SOA. At the same time, we also migrated the majority of our workloads from EC2 to Kubernetes. As we underwent such fundamental infra changes, our legacy in-house service mesh no longer meets our needs. In 2019, we started a search of modern service mesh and landed on Istio as the foundation. For more information about this choice, feel free to check out our Istio Com talk earlier this year linked on the slide. Last year, we evaluated different deployment models and deployed Istio in production. We started migrating a small percentage of Airbnb workloads onto Istio. And by end of last year, we have about a few dozen services connected to Istio and a few percentage of traffic going through Istio proxy. This year, after successfully adoption on different kinds of workloads and operating Istio in production for a few quarters, we established enough confidence in both Istio itself and our operational expertise in Istio. We started our full-speed migration to Istio. Currently, we have connected almost all of our 1,000 microservices to Istio and migrated about third of production traffic onto Istio. Our plan is to fully migrate and sunset our legacy service mesh next year. As we productionize our next generation service mesh, we encounter the few requirements. The first requirement is multi-cluster. As Airbnb doubled down on Kubernetes adoption, all Kubernetes usage exploded. We started running to scalability issues in SED and API servers. So instead of vertically scaling a single Kubernetes clusters, we made the decision to horizontally scale out by distributing workloads across multiple clusters. In practice, we keep each Kubernetes cluster under 1,000 nodes and add more clusters when we need capacity. For more info about the multi-cluster design, please check out our 2019 Coupeton Talk link below. Translating this into concrete requirements for our next-gen service mesh, we needed to scale for tens of Kubernetes clusters with hundreds of microservices, thousands of nodes and hundreds of thousands of pods. How do we partition our workloads across clusters? At Airbnb, we treat clusters as the pools of compute and memory resources. Workloads are randomly assigned to clusters at workload creation time. As a result, services makes cross-cluster network requests all the time. In this example call chain, three cross-cluster network calls were made. This simple example pretty much illustrated what's happening in production. It often takes more than 10 hops, most of which are cross-cluster to serve a user request. This cross-cluster communication pattern heavily influences our next decision. As services communicate cross-clusters all the time, the added latency of cost of going through a gateway for every hop is too expensive for us. That's why we adopted a flat network model across clusters. We leverage AWS VPC CNI to assign individually addressable VPC IPs to pods. So pods can make direct connection across clusters. Using VPC CNI for cross-cluster communication has a few benefits over our in-house service mesh which utilizes Kubernetes no-port services. VPC CNI has a lower overhead than no-port, which use lots of complex IP table rules which doesn't scale particularly well for large clusters. VPC CNI also does not lead to a well-known pod colocation problem, which is when multiple pods of the same service are scheduled onto the same node, they appear as one address and gets deducted by Envoy. So each pod on this node will get less traffic. And finally, by providing pods with VPC IPs, we can leverage the rich functionalities cloud providers built. Examples include assigning pods, custom security groups. Flat network does come with its requirements. First, your organization should not solely really lying on network boundary as the only security measure. At Airbnb, our flat network is segregated into different security zones with security groups. Within the flat network, we follow a zero-trust network model and where all workloads are protected by MTLS. Secondly, a flat network requires all your workloads to be using non-overlapping IPs. Luckily for us, this has always been the case. We the infra team centrally manage our private site arranges and allocated two slash 10 site blocks for mesh workloads, which can fit 8 million IPV for my IPs and give us enough headroom before we move to IPv6. Finally, AWS has a VPC limit on the maximum number of IPs you can fit into a flat network. Since VPC CNI assigns unique pods, one node with 16 pods consumes 17 IP mappings, which hugely increased number of IP mappings in our VPC and pushes over the limit. After close collaboration with AWS, we now leverage a new VPC feature called prefix delegation. Instead of tracking pod IPs on a node separately, the VPC now assigns a single slash 28 prefix, which fits 16 IPs by the way, to a node and allocate pod IPs from that prefix range, which greatly reduce our mapping usage. Taking into full consideration of all our multi-cluster requirements, we adopted external controlling and flat network model for Istio deployment, which has the following benefits. First, it leads to a clean separation of roles. We the service mesh team serve as both the mesh operator and admin. We manage the upgrade and release of Istio and we provide the service mesh platform. Service owners on the other hand are in charge of managing their own services with tasks like defining allow list, which service allowed to talk to my service and allocating sufficient resource for their Istio proxy. Secondly, external control plan provides better security. We only need to install the CA cert issued by Airbnb's internal CA system on the control plan cluster. We can enforce a tight access control for the control plan cluster. Thirdly, external control plan model provides better isolation from data plan workloads. We wanna make sure that no data plan workloads can affect the house of the control plan. Finally, we perform Istio upgrades quarterly. It's much easier to operate one control plan per mesh than end deployment across clusters. A single external control plan does impose a strict requirement on the availability and reliability of the control plan. We'll cover in the next section the steps we take to meet that requirement. Here's our high-level architectural diagram within a single mesh. We deploy Istio in a dedicated external Kubernetes cluster. It is the blue box in this graph. Only control plan components run from this cluster and only the service mesh team has permission to this cluster. All Istio custom resources reside in this external cluster. But mesh users do not directly modify their resources. We provide a simple configuration file that mesh users interact with. They check in this configuration file alongside their service code. And during CI, we generate Istio custom resources from that configuration file. The generated Istio resources are managed by our deployment system and all config changes are made by a deploy which is monitored and can be easily rolled back. The remote workload cluster, the pink boxes in this graph only need to define a sidecar injection web hook which delegates the injection of the Istio proxy to the control plan cluster. Other than the web hook, workload clusters only need to define Istio reader row which gives Istio permission to watch for Kubernetes updates. How do remote clusters discover the control plan? We run a deployment of external DNS which on the management cluster, sorry. And configure external DNS to export Istio as a headless service registered in row 53 to be used as the discovery address for the workload clusters. In addition to multi-cluster requirements, we also have multi-environment requirements for the mesh. The first kind of environment I will cover is multi-tier. We rely on a multi-tier mesh to minimize downtime of the control plan. Airbnb services are generally divided into three tiers, test, staging, and production. When building our next generation service mesh, we follow a similar concept of tiers to minimize the blast radius of changes. Our first tier of mesh deployment is the sandbox tier. In this tier, we run functional and performance tests to validate the new release of Istio and Istio proxy. After the new release is verified on the sandbox tier, we deploy it to the test tier. TestMesh has identical setup as the production mesh but runs in a separate AWS account for maximum isolation. And finally, we deploy the new Istio release to the production mesh, both staging and production tier of services connect to production mesh. In sandbox mesh, we run automated functional tests to verify mesh features that we depend on in production, like authorization policy or locality-based load balancing. Additionally, we define regression tests for issues we found in the past. For instance, we encounter the bug where Istio would get stuck if a large deployment was rapidly scaled down. We automated the repro of such regressions and make sure that issue continued to be fixed for future releases. We also run data plan performance tests with our specific configuration of Istio proxy using Nighthawk. In TestMesh, we test every single integration point with Airbnb systems and run end-to-end integration tests. For instance, we test that Istio can securely mount the CA certs from our internal CA system. We test new Kubernetes version release won't break the service mesh. We test changes to the resource generation logic. And finally, we can test all the custom logic we built for mesh expansion and external services. We have successfully caught regressions in both Istio itself and in related Airbnb systems in TestMesh. After verifying the safety of the new release on the standalone mesh and TestMesh, we deployed new Istio to a production mesh with a revision label. The initial deploy was safe as there will be no workloads connected to the new version. We gradually increased the scope of services connected to the new version as we build confidence. Note that services only connect to the new version after a deploy. At any time, if there's any regression, service owners can simply roll back the deployment to pick up the old version. The entire rollout process takes about a month. After there's no more proxies connected to the old version of the control plane, we will clean up the old version. This rollout process allows us to slowly increase the number of proxies connected to old version. As a matter of fact, we have just finished up rollout of Istio 1.10 and teardown of 1.9 one month ago. This graph shows you a gradual increase of proxies connected to 1.10 versus 1.9. We can also side-by-side compare the new version against the old version. This graph shows you the impact of how a new feature called discovery selector introduced in 1.10 reduced the CPU usage of Istio when we excluded some of our most noisy name spaces. Now I'm going to hand over to Stefan and to talk about the rest of multi-environments. Thanks, Weibo. So we've covered how our service mesh spans multiple Kubernetes clusters and multiple environments. So now we're going to deep dive into two non-Kubernetes use cases. First, we're going to talk about how we support workloads running on uncontainerized EC2, which internally we call mesh expansion, as well as external services like managed data stores. So here's some context about why we wanted to support EC2 on Istio. So in the long term, we want to run all of our internal services on Kubernetes. However, in the shorter to medium term, we want all the benefits of a single uniform service mesh running across all workloads. And the workloads that are not already running on Kubernetes will take years to either migrate or deprecate. These are either legacy or very stateful workloads. So the solution that we came up with was to support EC2 on our service mesh. The requirements are first, we want to have full feature parity with Kubernetes. So service owners can reuse their knowledge of how to work with service mesh on both EC2 and Kubernetes. And secondly, we want to allow gradual traffic shifting onto Kubernetes when services are ready to migrate. So let's go back and revisit Weibo's slide on Kubernetes architecture. This diagram adds a high level overview of how EC2 workloads fit in. So the orange box on the bottom right shows EC2 workloads as part of the mesh. Now that we've covered the high level architecture, we'll dive into the mechanics of how we got to feature parity for EC2. First, we'll talk a little bit about how endpoint registration evolved. In Kubernetes, the endpoints controller automatically handles this by updating endpoints resources based on pod readiness. The resources then get watched by IstioD and distributed to clients. When we first decided to adopt Istio, we had to manually tell the Istio control plane which EC2 IPs belong to which services by hand editing configs. Due to the number of EC2 services and the instance change update rate, this added unacceptable additional operational load. Over the past year, we've worked with the Istio community to define our requirements. The solution the community built is a feature called auto registration. So the Istio data plane has two processes which run on all service instances whether they're EC2 or Kubernetes. These are pilot agent and Envoy. When Envoy starts up, it will connect via XDS to pilot agent which proxies the connection. Pilot agent will then connect to the IstioD control plane and provides metadata about which service it belongs to using that XDS connection. IstioD will create a Kubernetes CRD called workload entry on behalf of the connecting pilot agent process. And then the workload entry is what registers the service instance in the mesh. Other IstioD pods will read workload entry and distribute that service instance to other clients. Service-side health checks were also implemented by Istio community in the past year. The health check process begins in the same way as auto registration. First, pilot agent proxies XDS for Envoy. Pilot agent then does health checks which is similar to how Kiblet does health checks in Kubernetes. Pilot agent will then send health information across the same single XDS connection to an IstioD control plane pod. And IstioD will alter the healthy status of auto registered workload entry. Other IstioD pods will read and react accordingly adding the endpoint if the workload entry transitioned to healthy and removing it if it transitioned to unhealthy. The TLS story is a little bit more complex. So Istio community plans some automation around getting certificates. But the timeline is unknown so we built our own automation. There are two tasks needed to bootstrap TLS. First, we need to get the root CA. And second, we need to provide proof the CA to get a workload certificate with a specific send. IstioD starts by writing the root CA as a config map in the Kubernetes API. We built an in-house tool called MXAgent which creates a service account token and reads the CA config map that IstioD wrote by calling the Kubernetes API. MXAgent then writes the CA cert and token to the local file system on the VM instance. When Pilot agent starts up, it will read the CA and token from the file system. And then issue a CSR to IstioD. Finally, Envoy will get its certificates from Pilot agent using SDS. Istio version upgrades are also slightly complex. So let's backtrack and quickly review how Istio versions are chosen in Kubernetes. So in Kubernetes, we use an internal tool called CRISPR which runs as an admission controller. It reads a configuration file called rollouts.yaml which maps Kubernetes namespaces to mutator artifact versions. And we wrote a blog post about CRISPR which is linked in the bottom of the slide. CRISPR downloads the mutator artifact from storage and then patches the pod label with an Istio version label. IstioD then gets called as another mutating admission controller which does the main sidecar injection. And the IstioD version called is determined by the Istio revision label. CRISPR is designed with the Kubernetes use case in mind so we couldn't reuse this for EC2. However, we took key lessons from CRISPR and created a very similar interface. For EC2, we built a parallel system. First, there's a CLI named MxRollout which reads a rollouts file similar to CRISPR's rollouts configuration format. MxRollout will then update an EC2 tag named desired version on a particular instance. When MxAgent starts up, it will read the desired version tag value. And this tag value maps to an artifact which contains pilot agent and envoy binaries as well as other configuration. MxAgent will download this artifact, unpack it and restart the pilot agent service running on the host. Finally, MxAgent updates the value of the running version tag on the instance. Here's a high-level overview of CRISPR and our Mx updating system side-by-side. There's several similarities that you can see between them. Okay, let's switch tracks and talk about how we supported external services. Most of our external services used custom protocols like MySQL and Redis. So when scoping solutions, we grouped external services and non-HGTP services together. External services can't run Istio data plane components so we have a slightly different set of requirements compared to supporting EC2. First, non-HGTP support was important. We wanted to allow service owners to define custom Redis pings or MySQL health checks. Secondly, we wanted to use DNS to map external service URLs to a unique Kubernetes cluster IP. External services are registered in Istio using service entry, but without an IP field, service entry results in an Envoy listener that takes up all IPs on a given port. We're moving off of a service discovery system that relies on port allocation when we want to avoid this kind of setup wherever possible. And finally, we want to remove stale endpoints. For example, when database replicas are torn down. So here's how we solved server-side health checks. We built a generic health check sidecar called MHCX. Airbnb internal owners will build protocol-specific health checks into an application container and then run MHCX as a sidecar. MHCX is responsible for translating health check results into workload entry updates, effectively doing the same thing that IstioD was doing with VM health checking. And then the app health checker main container is responsible for querying external endpoints using the custom protocols and providing an HTTP response to MHCX. For our DNS cluster IP setup and endpoint removal, we have two parts. So first, for the DNS cluster IP setup, under the hood, we create both a Kubernetes service and a service entry CRD. We ensure that the service is created first before the service entry and then when a new service is created, when the new service entry is created, we add a mutating admission controller which patches the service entry IP with a cluster IP of the generated service. And for stale endpoint removal, you rely on spinnaker to run a cleanup command, which is similar to kubectl print. Now that we've covered several environments in which we run Istio, let's cover the key concepts which we used to successfully run it. First, we ran a globally flat network, but be aware that there are considerations around utilization and scalability of this approach. Secondly, we use a single external management cluster for each service mesh. This is important for secure and simple management of each mesh. Third, we use multiple tiers of pre-production meshes which allow for different types of testing and ensure we catch as many issues as possible before production. Fourth, Istio support for VMs has matured considerably in the past year. We built additional tooling to reach full feature parity and allow long-term migration path to Kubernetes. And finally, we have generic mechanisms for supporting non-HTTP and external services. Which ensures the service mesh team does not have to implement N different envoy filters for N different protocols. All right, finally, Airbnb is hiring. So please check out our website for open positions. And now I think we have a few minutes for questions. Thanks, it looks like we have about eight minutes left for questions. So raise your hand, I'll try and run around with the microphone and starting up front here. Hi, that was a great presentation. We had a lot of things packed in there. One question when it comes to a lot of services in the cluster that Istio has to look upon. So I know Istio gets a lot of chattyness. How did you manage that? So Istio has a Sidecar CRD that filters out the listeners, the clusters that you get. As part of our simple configuration file that we ask service owners to provide, we force everyone to define the exact set of other services that they care about. So that way it's greatly reduced the number of unnecessary updates to each service. So it's a more controlled environment but how does the owner have to know? Yes. Because what does that sound like a doctor? Totally understood. We tested Istio without using that Sidecar CRD and the performance. It's hard to scale for tens of thousands of parts and thousands of services. I got. I was more curious, because our environment looks a lot like your guys's. Do you try the multi-cluster mode in Istio and not use the service entry? Can you provide a little more information about how the service and how multi-cluster would? Yes. So the new versions of Istio you can define different. You can call your different Istio, set up different names. And then basically it talks to the Kubernetes API that is trying to access and then the service endpoints are advertised everywhere. So you don't have to create service entries. There are all clusters know about all the endpoints. Oh, I see. So we are allowing Istio data read from multiple clusters. For service entry, we use that exclusively for external services. So like services that we aren't running in house. So yeah, if we go back to the architecture diagram, we are running multi-cluster mode. So the centralized Istio D subscribe to different workloads, Kubernetes clusters. And for Kubernetes workloads, they use directly use Kubernetes services and Kubernetes endpoints. And we don't define a service entry. It's only for non-Kubernetes, like VM workloads and external services. We define service entry because there's no pods. Okay. Can I ask one more question? Yes, I'm curious. Because we, again, we run in this problem. Do you guys use the home chart or use the profile Istio profile file, the CRD? We use the operator to generate the basically raw manifest and check them in into the source code. So we know what, like, we feel uncomfortable asking it operator to handle all the upgrades. We would like to know what exactly it's going on with each Istio release. Are you using the two figures to generate those? Manifest, yes. Yes. So the question is, where does the sidecar injector live? The webhook live on every workload cluster, but the mutator service, the ECOD, only runs from the management cluster, the external cluster. So there will be a per workload cluster webhook that basically points to the central ECOD cluster. So there will be a workload cluster webhook that basically points to the central ECOD cluster. ECOD address. Since you bring up so many clusters dynamically, do you ever hit any rear-grace condition? So, we run verification workloads on each cluster before we market as schedulable basically before we can assign workloads to that cluster. And our rate of cluster creation is more on that order of hours rather than minutes or something. So we usually go through a verification process. Is that your PD? Yes. Correct, yeah. One of the most painful things we face when we upgrade Istio version is the Envoy filters. Because the APIs break really easily. It is super painful. How do you manage that in Airbnb? I mean, yeah, yeah. Envoy, we tried to keep our Envoy filters fairly portable. We've written two, one that customizes logging slightly and one that customizes metrics to use like a native Envoy filter. But I definitely agree that there are some pain points with upgrading because the internals of Envoy configuration are unstable. So we just try to keep it as simple as possible there. Yeah, we definitely felt a pain point. Hi, I had a question. You mentioned earlier that you noticed that you had a bug where Istio was getting stuck on rapid scale down of deployments. I was wondering if you could provide a little more details on where was Istio getting stuck and how you fixed that issue? Yeah, that particular issue, I can't send you the get-and-get the issue offline, it happens a while ago. I think what we observe is when large deployment rapidly scale down and IstioD will basically fail to send out any subsequent updates. More details can be found in the GitHub issue. I don't remember any more details right now, sorry. The issue that we ran into was, I think, in 1.5 or 1.6, so several minor releases ago, and we haven't observed it since. So when you have two clusters, when services in one cluster, can they talk to services in the other cluster without changing anything about the services in one cluster? Can they talk to services in the other cluster without changing anything about the service definition itself? So the services themselves wouldn't know which cluster pods they're talking to, or you have to specify which cluster you want to talk to. So when services, when they're making cross-cluster network calls, they are unaware of the destination cluster. So we define a mesh ID, depending on the region EA1 US, and that's shared across clusters. So from the perspective of a service, it doesn't know where the destination service runs from. As a matter of fact, that service can run from multiple different clusters, and IstioD is going to combine the endpoints, like fetch the endpoints, and combine the endpoints, group the endpoints on the same name space, and so request can be load balanced across clusters. So I think we're at time. So thanks, we can continue questions outside in the hallways. You can clear the room so they can close down. So thanks everyone for coming. Appreciate it. Don't forget to rate the session afterwards in the app.