 Hi. This talk is about autoscaling elastic Kubernetes infrastructure in the presence of stateful applications that use Proxeles, GRPC, and Istio. My name is Sanjay Pujare, and I am an engineer in Google Cloud. And I'm Kostin Manolaki working on Istio, also at Google. So how many of you attended the GRPC maintenance talk yesterday, a show of hands? OK. So we'll start off with a description of Kubernetes elastic and autoscalable infrastructure. I'll then talk about the problems faced by stateful applications in such an environment. We'll look at implementing a stateful session affinity that's needed for stateful applications. I'll discuss an approach that's based on cookies for stateful session affinity and how it is implemented in the Proxeles GRPC library. Session training and canary deployment are important aspects of this discussion, because they affect stateful session affinity. So we'll touch upon those. We'll then move on to a real-life use case from Broadcom, who are going to be using this feature in their WSS software. Kostin will talk about how to use this feature using the new Gateway API and the status of our implementation. And hopefully, we'll have a few minutes towards the end for Q&A. OK. It's Friday. I suspect you already heard about Kubernetes autoscaling and how awesome it is. Autoscaling allows Kubernetes to scale up your pods to start new pods when traffic increases. It reduces number of pods when traffic goes down. It's wonderful, but it has some dark corners. There are some small issues. For example, you know probably that when the pod starts up, you need to wait for it to warm up before readiness probe is passing, so you don't have high latency or errors at startup. Similarly, when the pod goes down, there are some things you need to be careful about, and we'll talk about this in more details. In this example, we are focusing on the scale down, and we are showing that pods will eventually be killed, and then nodes will be freed up, and your application can optimize the costs and support better characteristics. Let's go into the details here. OK. So when downscaling happens, there are two things happening. Traffic to existing backends can get shifted, even when ring hash load balancing is used, and backends with assigned sessions might be removed as part of downscaling. In this diagram, the load balancer, which is shown in the yellow rectangles, is load balancing traffic to backends running in Kubernetes. When the HPA kicks in, which is the horizontal pod autoscaler, the pod end in node two is shut down, which can result in two things. There is a change in the number of backends, so the load balancer will shift traffic to other backends, which is what typically happens when ring hash load balancer is used. And another issue is that for the back end that was removed, all of the traffic that was supposed to go to it will now need to go to another back end. And maybe to maintain stateful affinity or session affinity, the back end should not have been removed, or maybe it was removed too soon. So how do we fix it? So I mentioned the ring hash load balancer, which is quite common, and I assume most of you know how it works. Let me explain its limitations. So on the left-hand side, we have five backends on the ring. When a request with the hash value of 111 comes in, the load balancer sends the request to the server with hash value 139, because that is closest on the ring. So far, so good. Now, a new server gets added, and its hash value is 115. You can see it sits between the server 74 and 139. Now, a request with the same hash value, that is the request in the same session, comes in. The ring hash load balancer is going to send it to this new server 115, because it is closer to 111. And session affinity is broken. This happens because session affinity here is stateless. We need something stateful, something that remembers where we sent previous requests in that session. So ring hash doesn't work. We need stateful load balancing for our stateful application. In this diagram, we have a hash table in our load balancer, which maps each client session to a back-end. You can say the session is assigned to a back-end. Imagine an actual proxy load balancer doing this. It'll have memory requirement proportional to active client sessions, and there is no easy way to remove entries, because the load balancer may not be application aware and does not know when sessions end. Also, multiple load balancer proxy instances may need to synchronize their data. So this is hard to scale and is expensive. So we use a cookie to maintain that state. The load balancer issues a cookie after routing the first RPC in a session, and the cookie is returned by the client in all these subsequent RPCs in that session. The cookie basically encodes the back-end address, which is the IP plus port, and all these subsequent RPCs. The load balancer just decodes the cookie and gets the back-end address and routes the RPC to that address. It's that simple. We get the client to maintain the state for us. So here is a diagrammatic explanation of the feature. The first request is sent by the load balancer to server 2, which is picked by the load balancer based on some load balancing algorithm. Once the response to that request, shown here as RSP1, is received by the load balancer, it adds a cookie to that response. The cookie value is nothing but the server 2 IP plus port, that is, which is way 64 encoded. So RSP1 is received by the client, and the embedded cookie is used by the client in all these subsequent requests in the session, shown here as REQ2 through REQN. The load balancer just decodes the cookie value and uses that value, which is the back-end IP plus port, and sends those requests to server 2 thereby maintaining session affinity. With Prox alert GRPC, the load balancing functionality is implemented inside the GRPC library, which is inside the client application process. GRPC performs load balancing of the first RPC and adds cookie to the first response received. Client application copies the cookie from the first response into all these subsequent RPCs, and GRPC uses the cookie to route the RPC to the appropriate back-end, which is server 2 in this diagram. Istio uses XDS to provide the necessary configuration to GRPC, which typically consists of the service and route-level information, and the cookie name, cookie time to live, and all those things. Istio also collects all the back-end information, like list of back-ends, including the clusters they belong to, and sends it to GRPC, which GRPC uses to perform load balancing and maintain cookie-based session affinity. But what about the second problem we mentioned earlier as part of downscaling? So that part of the diagram is shown here. The problem is that parts or back-ends are removed, whether or not they have assigned sessions. After removal, the RPCs will be routed to a different back-end, part 2 in this example. And session affinity is broken. To solve this, we create a new state for a pod. Let's call it draining. A draining state pod is scheduled for removal, but not removed immediately. Instead, that pod is kept around for it to drain its sessions. So once all its sessions are gone, the pod will be shut down and removed. During the draining state, it will continue to receive RPCs of its assigned sessions. So let's see how this is implemented. OK. OK, so in Kubernetes, we have put termination, which is generally in three phases. First, when the auto-scaler or the user decides to terminate a pod, the shutdown hook, a pre-stop hook is called, which is some custom code that you control. And the pre-stop hook may coordinate the application with the application itself to verify if you have pending sessions. It may do all kinds of fancy save states to migrate to a different pod. Or it just can sleep and wait until you expect your sessions to expire. When the pre-stop hook terminates, the pod will receive the SIG hub first, and then SIG term first, and then SIG kill to actually kill the pod and terminate the container. Kubernetes also has a terminating grace period that allows you to control the entire duration of the termination to give time to the application to drain the pods and to control the shutdown process. Let's see how Istio implements this and how it takes advantage of this feature. Istio is watching Kubernetes endpoint slices. And in the endpoint slices, Kubernetes will mark the pods that are terminating and tell Istio that we need to notify all the clients, GRPC or invoice, about the pod being in this drain state. And Istio will update the IP address of the pod that is terminating, mark it as draining, and envoy and proxy SGRPC are going to take this state and switch the behavior of the load balancer so that all new requests will skip this IP, this backend. And if you have a cookie, it will still go to the pod that expects to receive the session. So that takes care of the draining. But there is another problem which affects persistence sessions, which is the candidate deployment or traffic shifting. This example is using Istio example, but we are using the new gateway API that you probably have heard in a few sessions before. We are using the HTTP route. We create two services, let's say version 1 and version 2. You are rolling out version 2. You want to send only 10% of the traffic to the new version to verify that it works and in case of problems to roll back. How does it impact persistence sessions? Since we are shifting traffic, if you have a cookie, even if the request would normally be affected by the 90% to 10% shift, the session will bypass this configuration and it will still go to the original backend where the session is expected to go. As you can see in this diagram, the first RPC of the session is Cento Cluster V1. The response from the backend V1S1 is annotated by the load balancer with a cookie indicating the backend address. However, for the next RPC, RQ2 and possibly others in the session, the load balancer picks Cluster V2 based on the cluster weights. And if those RPCs are sent to Cluster V2, session after it is broken because the backend V1S1 obviously does not exist in that cluster. So how do we fix this? We solve the problem by including the cluster name in the cookie. We show here a sample cookie value, which shows Cluster V1 as part of the value. The first RPC is processed by backend V1S1, that is part of Cluster V1. The load balancer includes the cluster name in the cookie that is added to RSV1, the response that is sent to it. The client includes the cookie in all the subsequent RPCs, and the load balancer extracts the cluster name from the cookie and selects that cluster instead of using the weight-based cluster selection algorithm. And within that cluster, it selects the backend included in the cookie, and session affinity is maintained. Of course, provided the cluster and the backend within that cluster are still part of the configuration and are healthy. So after having gone through the technical details of the implementation, let me talk about the use case to provide some context for why we are doing this. I mentioned Broadcom earlier. So Semantic Enterprise Division is part of a Broadcom software and this division, I mean, the software is used to secure customers' internet access globally, providing smooth service to millions of users. Some of the larger customers have over 300,000 deployed users. So their solution called WSS, which is a web security solution, is deployed on Google Cloud Platform in all of the available regions around the world. Customers' internet-facing traffic is securely routed to WSS services. The policy evaluation microservice is a core component of the WSS, a solution data path. The service as a whole must handle tens of thousands of requests per second from various enforcement agents on the data path. These agents submit series of requests on a per-customer transaction basis and the state of those related requests must be efficiently correlated when evaluating customer configured security policies. With such high RPS requirements and the need to maintain a state across RPCs, the impact of storing the state external to the service instance becomes a bottleneck. Moreover, keeping a coherent shared state across instances is impractical. Most RPCs actually mutate the state, maintained per customer transaction. So the classic model of sharding or consistent hashing algorithms would require a partial state resync any time the number of serving pods changes. If they can efficiently maintain pod local state while preserving correct and consistent request routing, we can simplify the state management. Strong session affinity, also known as stateful session affinity, enables such consistent request routing to pods without much complexity on the load balancer and the service side. This also allows the clients to collaboratively provide hints to the routing logic such that we can achieve greater internal service caching. Stateful services introduce additional requirements when using some of the common paradigms such as canary and HPA-based scaling or rolling upgrades. Since the serving pods have stayed, the design needs to be cognizant of that and accommodate routing existing sessions correctly and draining the state properly. It provides a lot of features for load balancing and good framework for advanced use cases like this one. And it now includes stateful session. It always had traffic shifting, draining, and load balancing support. And Voi is a very powerful proxy and it's great for most use cases, but it does add a bit of performance overhead that impedes the adoption of proxy-based service mesh in such use cases where ultra-low latency or high requests per second are required. Broadcom has been working with Google and with Istio to enhance and formalize the support in Istio for stateful session affinity and also with the GRPC team to add support for stateful sessions and Istio to the GRPC framework. This slide shows some of the performance evaluations that they did for this. Like all benchmarks, you need some context. This is obviously using their super-specialized applications that is optimized. It's C++, it's using in-memory only. And on the Istio side, we are using a kind of untuned, we don't increase the CPU, we don't turn off features, but as you can see, the 30,000 QPS are usually not achievable with Istio out of box in this scenario, and the latency is slightly better with GRPC and it meets their needs. This is not a typical use case. I mean, it's only for applications that have those kind of strict requirements, but when you have to, that's probably one of the best options you have. Let me talk about the status of this feature and all the related sub-features we covered here. We added this feature to the GRPC and there is a design that is linked here on the GRPC side. And we had persistent sessions already and training, and that was used in GRPC, so we share the same concepts and same configuration. We had to provide some API to enable this feature in Istio. After long discussions, we added a label, but that's not going to be the final API. We are waiting for the gateway working group to formalize the common APIs that other vendors will hopefully use to implement persistent sessions. Istio, as I mentioned, Istio is one of the many implementations that support the gateway API, and we are working with the working group to formalize this feature. Google also has another implementation of gateway API called traffic director, and they also support session affinity but in a slightly different way, and I'll provide an example a bit later. By now, I suspect I don't need to ask if you know what gateway API is or quite a few talks about it. We are using it in Istio to model L7 traffic and we are using the gateway HTTP route. There is a GRPC route as well. We implemented this probably due to this is back and it has been available, and we are planning to make it the default and recommended way at least for GRPC because it's a very good fit with the GRPC feature set. We are also using the exact same API in the GKE gateway controller, traffic director, and as I mentioned, many other vendors are adopting it. In gateway API, there is a concept called vendor extension which allow each vendor to define its own specific features before they are standardized or maybe for a long term. This example here is how per sensation is implemented with Google, GKE, and traffic director. You will notice that all vendor extensions have a target ref which is a service where they apply, and you can define different configuration for that particular service. Once the working group will define the appropriate API, we will also support the new API and both Istio and traffic director and GKE gateway controller, and everyone will switch to the new API. Questions? I think we managed to say five minutes. Any questions? Okay, looks like no questions. I already approached you yesterday. I don't know if you recall me. We kind of have the opposite problem where we have HPA autoscaling up and we have the connections affinity to the few existing parts that were before, do you know of any approaches to spread the load on these persistent connections? We would prefer killing the connections rather than keeping them, right? It's kind of orthogonal. I think you posted that question on Slack as well, right? I didn't. Okay, maybe you want to answer? Yes, that's something that is a bit complicated because normally with persistent stations, the definition of the feature is that if you have a persistent station, it needs to keep going there, but you can delete the cookie, and if you delete the cookie, then you will go to the normal load balancer, but that's something that your application needs to do. It cannot be automated because we cannot guess. Other questions? Yes. Hi, I have a question that... So if the user's session is maybe one hour or 24 hours, the board will start for 24 hours, right? It cannot be terminated during that time. That's a feature, not a bug. I mean, if you want your session to last for 24 hours, you will keep it for 24 hours because that's what the user... Scanning down is very problematic for the cluster and application because we cannot scout out immediately in the short term. It's not very responsive. One thing you can do, I mean, you can try to move the state to a different... I mean, to transfer states. So is it like a workaround for very chases, you know, stateful applications, not very long, very... Night designs, or what do you think? Okay, if... There is something called terminating grace period. So once Kubernetes or the user wants to reclaim the pod, or shut down the pod, you can go to this terminating state and there is a timeout for that. Let's say one hour. So after one hour, regardless of the states, the sessions or anything, it will kill it. And so that's that way. I mean, of course, the application will get impacted because the, you know, you kind of... You are killing the sessions without taking care of them properly, but that addresses your problem at least. It means that from the user point of view, I log in and then even if I... My support was many of my sessions with terminating one hour, I have to log in again. Let's... If you can stop by if you could, we can discuss in more details. Yeah. Thank you for the session. I would like to ask about this proxy less model. Is it something that will be developed for gRPCs, so there will be no need for, like, ANVOY, or what does it mean? Yeah, okay, that's a good question because... So we have actually implemented proxy gRPC almost three years ago, and it has been working in Google Cloud with the traffic director that he mentioned. It has been... It is a supported GA feature for the last two and a half years, at least, if not three years. So recently, I mean, even with Istio, it has been working for the last one or two years informally, and we are in the process of doing a GA of that feature. So, yes, to answer your questions, proxy gRPC with Istio without ANVOY proxy is a thing, and it will be a proper re-released feature in maybe towards the end of this year. It works with Go and C++ and also Java. Sorry. Okay, thank you. Okay. Yeah, just one comment. It doesn't support all the Istio features, and it's intended for advanced users that have this kind of latency requirements. So it's still... Probably if you want to take advantage of all Istio features, it's probably still good to use the sidecar. But if you are in this scenario, then proxy less, it's a very good choice. Okay, thank you. On the gRPC GitHub report, there is a file or page that talks about XDS features supported by gRPC, and that is... We try to keep it up to date, so you can see there which features that ANVOY supports are also supported by gRPC, and based on that you can... Okay, thank you. Hello. Yeah, I have a similar question to what other people have asked. It's related to that it will become kind of a sticky session, right, when we implement that. And what I was thinking, if someone wants, and they know that it's kind of a sticky session, they will try to flood them with more requests because as a session is still there, so how do we deal with that? It will always go to just one server and it can take it down, so... But that is a feature, I mean... So typically, in the use case that we talked about, the client will actually end the session. So the client knows when the session is ended, so it will send something like an end session RPC. And at that point, the backend will actually know that it doesn't have to wait in that session anymore. So ultimately, over a period of time, let's say two hours or whatever, the backend will get rid of all the sessions and it can then be shut down as part of scale down. So it's really something, the application and the infrastructure need to work together, right? Correct, but in the meanwhile, if someone needs to exploit it, is it not dangerous? Oh, yeah, that's true. In terms of exploitation and all, it's possible. Let's discuss it. We are out of time here, I think, and let's discuss it. It's your booth on the other side. Okay, okay.