 Hey, everyone. Thank you so much for being here. My name is Charles, and I work at Snowflake. And today, I would love to tell you more about our service match adoption stories at Snowflake, specifically because Snowflake product is multi-cloud. It means our infrastructure also has to be multi-cloud. So sizable amount of challenges that we face actually stem from this multi-cloud or cloud agnostic requirements. Going too fast. Before we move in, just a bit more about me. I work on the container platform team at Snowflake. Joined Snowflake more than two years ago. And before Snowflake, I was at Cruise, also doing container stuff. Cruise is a self-driving car startup. And before Cruise, I was actually at Google building the SDO open-source project. So being in this domain for a little bit. So first, I want to tell you a bit more about Snowflake. We went public two years ago. But Snowflake started as a data warehouse. It does big data analytics. Banality brands itself as the data cloud. It does data sharing, machine learnings, and basically everything related to data. There are four properties that I think stands out for Snowflake in general. First, this multi-cloud. As of today, we support Azure, GCP, and AWS. And maybe adding more in the future. It's multi-tenant. Snowflake has more than one customers. And sometimes more than one customers would share the same compute resources. And it's multi-region. We want our compute resources to be co-located with our customers' workloads. And lastly, it's multi-environment. We're supporting commercial deployments as well as government cloud. There are workloads that has to comply with ITAR or FedRAM-FIP's compliances. And that means our deployments and infrastructure have to conform to those compliance as well. And all those four properties about Snowflake products also apply to our infrastructure. So that's a source of the complexity, for sure. Specifically about container platform. As of today, we run 170 clusters. Those are Kubernetes clusters. And we expect to build a lot more in the future. We intentionally try to limit the size of the cluster, trying to limit the blast radius. So we're orchestrating hundreds of clusters, as heard, instead of pads. That means we also have strong requirements of building scalable, resilient automations to do really simple things like cluster upgrades or upgrading all the open-source components that we deployed on the clusters, like external DNS or a serve manager. We are using the cloud provider-managed Kubernetes. So they take care of the control planes. And we manage the worker nodes and all the workloads and open-source components we deploy onto them. So that's one last thing we have to worry about. It's multi-tenant, just like Snowflake product, in terms of namespace and also applications running in the clusters. Sometimes the same node could run 20, 30 different containers. And they're from different namespace, different applications. Lastly, because Snowflake is founded in 2012, and that's before Kubernetes was a thing. So we still have some applications running on VM-based infrastructure. We just need to make sure that VM-based infrastructure can interoperate with the container environment. And because of this multi-cloud architecture, it's very important for us to, as much as that we could, preserve this cloud-agnostic abstraction. Because the benefit is that the abstraction would prevent fragmentation and proliferation of identities, policies, and toolings. For example, imagine having different cloud-provider CLI tools in order to connect to a Kubernetes cluster and having to use different cloud-provider identities to authenticate yourself and manage all the credentials associated with those. And around policies, for example, on GCP, I would want to configure some firewall rules on each host. But on AWS, it's called security group roles, which is achieving similar features, but they are drastically different API calls. And authentication is managed differently. And all those little pieces, they're creating frictions in terms of building automations. Like, essentially, you're writing the same code three times if you want to support three cloud providers. So having some cloud-agnostic abstractions really reduced the complexity for both development and operation. And service match is a key part of it. And part of the reason is that we can push this policy up into this cloud-agnostic layer. For Istio, for example, we can define our traffic routing, security, telemetries, and all that into the Istio API groups instead of defining that using GCP-specific calls. Same thing is true for Calico network policies. And that's layer four network policies. And that's specifically for applications running outside of the mesh. OPA Gatekeeper is a very flexible and powerful tools for us to define policies. Like, we only want this list of container images from this list of hosts to be running on this cluster, for example. Instead of having three different cell-eye tools to access VM or Kubernetes, we'll use Teleport to manage the access and also integrate it with Okta for identity and group membership. We're using group membership mostly for RBAC and access policy definition. And lastly, we use Snowflake internally to query the logs and ingesting the metrics as well as defining alerts. Well, speaking of doc for the your own product. OK, I want to dive into the service match adoption story right now. And I want to begin with all the scaling, the ingress gateway. Now, obviously, you want all the scale because your traffic is pretty much unpredictable. And also during the night and weekends, the traffic are relatively small. You want to save money by all the scaling. And I want to tell you why it is more than just slap in horizontal pod all scaling, setting to your ingress gateway and call it a day. And the reason is because we want to preserve the source IP address at layer three. Preserving the source IP is important for rate limiting and also allow list policy enforcement. And the export for HTTP header is susceptible for spoofing. And we also have TCP services that doesn't speak HTTP. So we want the IP address at layer three. But we're having this special setup where the gateway pause are fronted by a layer four low balancer. Why layer four? Because rather than the traffic directly to pause, which GCP calls cloud native low balancing, is not the feature supported by all the cloud providers. And we do want some consistency across cloud. And we realize that a layer four low balancer has the most feature parity across cloud. And for layer four low balancer, it has this interesting behavior where traffic is routed to the nodes, essentially a node port, and then is IP tabled by the Q proxy to the pods. So there's like one extra hop on the nodes and then to the pods. And this behavior comes important when it comes to IP address preservation. And the traditional L4 low balancer setup has this periodic health check to the nodes to make sure that there are actual pods on the nodes serving the traffic. And on AWS, it happens maybe every 20 seconds. This is configurable, but this is a non-zero delay. And that means lifecycle management for gateway pause is very challenging. Because when the pod is terminated on node n, the low balancer would keep sending traffic to node n until node n is marked as unhealthy. And during that short period, your clients could receive 503s or just IO timeout because there's no response. And this is shameful to say, but for a while, we don't have all the scaling setups. We just over provision the SDO ingress gateways. And in order to preserve the source IP, here's our setup. We run the SDO gateways as demon sets. And we have a dedicated node pool for SDO. And within that pool, every node had an ingress gateway pod running on it to handle the traffic. And we set the external traffic policy to local. So there's no additional hops between the nodes where the traffic actually, when the traffic leaves L4 low balancer and hits the first node, that's the node that's going to handle the requests. Otherwise, if there's additional balancing off of other nodes, like the last point right here, the source IP address will be added to the proxy node. And that will break your rate limiting or allow list policy. But at some point, we need to terminate SDO gateway pods for upgrading the SDO or upgrading the node pools underlying the, that's running the SDO pods. The way to do it safely without dropping traffic is that we first need to de-register the hosting node from the cloud low balancer and making sure that the de-registration completes. And then we issue the pod delete request. And there's a way to do so in a cloud agnostic way by applying this node label and then coordinate the node. But of course, you have to develop some custom tooling to basically pull the cloud provider API to make sure the de-registration completes successfully because everything happens asynchronously. But we really want all the scaling because of the cost saving and also just to be safe when there's a traffic spike hidden in us. Our first design is almost a perfect straw man and feels very intuitive and easy to implement, which is to use an admission web hook to intercept all the pod delete request for the gateway ingress pod. And then just hold that request and at the same time de-register the node that's hosting the pod from the low balancer and when the de-registration completes, release the pod delete request. The showstopper is that on Kubernetes, the admission web hook has a 30 second max timeout after which the pod delete request will be released regardless. And 30 seconds is not enough to ensure that the node has been de-registered successfully. So where do we go from here? There are three possible solutions. The first is to explore EBPF data plane to replace the IP table-based Q-proxy, which has the really nice property of still retaining the source IP address despite setting the external traffic policy to cluster so that you can place your pod on any node you like and the traffic will be evenly split and you don't have to worry about source IP being changed. But for us specifically at Snowflake, we are curious whether RM64 architecture is supported and whether this is FIPS-compliant. And if not, we have to figure out either build our custom distro or if this is a showstopper for us. For those that you don't know or don't deal with federal customers, FIPS-compliant is basically requiring all the tools and systems that you run to adopt a federal approved implementation of all the cryptographic libraries. The native Go Crypto library is not FIPS-compliant, so you have to build your application differently. The second option is to use a custom no-pull autoskater that pretty much just worked around the 30-second restriction that I talked about for the admission webhook, because this is on a no-pull autoskater layer, there's really not a timeout around node deregistration. And lastly, there's a Kubernetes enhancement proposal trying to address this exact issue that I just described. The issue is that it's gonna be a long wait. It's gonna take at least a few releases for this feature to be ready and roll out and then the client and the cloud provider is going to take a few more quarters to actually adopt that release. The second challenge I wanna tell you more about is day two challenge of upgrading Istio. This is, we wanna do so in a safe way, which is a blue-green fashion. And we have to do this upgrade sometimes several times in a quarter because Istio often releases patch releases that address the CVEs and also the minor release might go out of the support window. Upgrading the 170 clusters manually would be insane because it would be a lot of work and very error-prone. So it requires some custom toolings and automations. Specifically, internally, we define this process by following this blue-green upgrade process. The blue-green upgrade process has two nice properties. We can validate the new Istio release before shifting the workloads over and we can roll back quickly if the new version has some bugs or it doesn't interoperate with the old Istio version that we're currently running. So I wanna show you a really quick animations of how the blue-green upgrade works. So we're on the same page. This process is inspired by the Canary release that the Istio community proposed. But let's assume that all the Istio components is on blue right now. That includes the control planes and the sidecar proxy. The first thing we do is to deploy a new version of Istio control plane, call it green. And then we will select a subset of namespaces, change the Istio revision namespace label to green so that a new version of the sidecar proxy is injected. Now notice that we're running blue and green meshes at the same time in the same cluster. This is important because by separating the pair of cluster and client into the two meshes, we can test the interoperability of the two versions of Istios. And the test is basically up to your applications like the sets of features that you use, MTLS, we're limiting traffic, shifting, et cetera. And lastly, we update the rest of the namespace to the green mesh as well and then eventually retire the blue control plane. So that's very intuitive. One caveat is that before we retire the old control plane, we need to make sure that the client traffic, all the client traffic has been drained to the new Istio gateways. And because we have essentially two separate deployments of Istio service meshes, the gateways are fronted by different sets of load balancers and that's an issue because to drain the client traffic over, it just means that the first implementation is to do a DNS cutover so that we update a DNS 8 record for the cluster ingress gateway to point to the new IP address of the new load balancer. And we can do it in a cloud agnostic way using external DNS which basically manage your DNS configuration using Kubernetes annotation. This is a diagram of showing how the DNS update works. So on the upper hand side, it's me updating the DNS record through our authority and that server and whichever DNS resolver eventually picks up that update and then I will return the new response to the client and the response can be pointed to either blue or green load balancer. Now, there's a lot of issue with DNS primarily because DNS, we have no control over client caching for DNS which is the second point right here where traffic shifting using DNS is very ineffective. In practice, we notice that after we updated DNS records, after two weeks, there's still traffic hitting the old load balancer and there's nothing we can do about it. We basically just set a deadline and then after that we gave up and just retire the old load balancer. But going back to the top, where for some cloud provider, external DNS updates are not atomic. It's done by two API calls, delete and then update. It's possible for delete to succeed and then update to fail. So then essentially all the DNS resolution would return NX domain and that's not great. And what's worse is that your client can cache the NX domain response indefinitely and there's nothing you can change about it. And even if both a delete and update API calls succeeded, there's a five to 10 seconds delay between the two operations and you're gonna get NX domain in those tiny window. So DNS updates, not great. So how do we work around this issue? Essentially, don't do DNS update, but how do we do that? We can do so with a common set of load balancer that doesn't change with regards to either blue or green. The way to achieve this common set of load balancer is to define a service object without label selection and configure an endpoints object manually. Now this is important because if you use a label selector on the service object you can only select pods from the same namespace where the service resides, but because we have is the blue and is the green and they're from different namespaces, we do wanna support pods from multiple namespaces. So that's why we're using service without a label selector and we're defining this endpoints object to be a list of IP addresses of the selected pods and which could be a mix of blue and green gateway pods. And we build a really simple controller that manage the endpoints object for us. So we don't have to do this manually because pods are by definition ephemeral. So here's a really quick animation of this common LB setup. Initially, the common load balancer in orange points to the blue Istio mesh. And then we deploy a new version of Istio and notice that it still comes with its own load balancer and those are ephemeral, they are there just for testing because we wanna make sure that traffic can actually hit the Istio green gateways. And in order to test the new gateways, they have to be behind some load balancer. But if we add it as a backend for the common load balancer in yellow, they receive production traffic instantaneously and that's not what we wanted. So we have to have a separate set of load balancer anyway just for testing. And then we update the endpoints object within the Kubernetes cluster to select the different sets of gateway pods. And lastly, we retire the old Istio mesh and also the temporary green load balancer. And then all the production traffic are now hidden Istio green. And notice there's no DNS updates as part of the steps. Okay, well, it looks like I got some time so I wanna cover this kind of appendix that I prepared. Some open questions with multi-cloud. Istio and Kubernetes would help with your multi-cloud adoptions, but it's not Panacea. Specifically, there are issues that he doesn't address. For example, the application that's using the cloud provider services still have to use cloud specific libraries and that means the same logic need to be implemented three times if you wanna support, say, the big three cloud providers. An example would be if your application rise to a blob storage bucket like S3 or GCS or write to a message queue. It's the pod identity, which on GKE is called the workable identity, which translates a Kubernetes service account to a cloud service account or role depends on your cloud provider. I mean, that only solves authentication authorizations to the cloud resources and doesn't solve the actual API cause of interacting with the cloud. And custom resource definitions can solve your resource provisioning, but it doesn't, again, it doesn't solve interactions of your applications actually using the message queue or the bucket. Yeah, and cloud resources and policies that cannot be abstract away by this cloud-agnosed abstractions remain heterogeneous. For example, in order to manage the cloud DNS records and having an identity that is granted permissions to only create, say, text records, that's something you cannot do with Kubernetes alone and you have to provision the identities and policies in the respective cloud provider that you're on. So with that, I think that is the end of my talk. Thank you so much for being here and for your time. We're hiring globally. And if you're interested in multi-cloud distributed systems, open-source software, can talk to me. Yeah, thank you. And we have actually plenty of time for questions. We have a five minutes until break, but after this we have a break, coffee break, so we can get coffee and maybe water because you know, coffee hydrates and you'll want to sleep more after drinking enough coffee. But questions, questions, questions, questions. Over there. I will bring the microphone just a second. That canary upgrade trick without a DNS swing is awesome, really cool. Thanks. I wanted to ask, you mentioned that the X4 before header is susceptible to spoofing. I assume you tried setting numtrusted proxies and it didn't work? Why didn't that work? I don't know. I think that's a decision imposed by the security teams and that's a decision before my time. But yeah. That sounds like something security teams would worry about. I'm just curious. Thanks. Any other questions? All right, so in this case, let's do one more round of applause for Charles. Thank you. And I believe. I believe.