 Okay, next up we've got Lorne Bernat is going to be talking to us about the evolution evolution of cube proxy So as I was saying I work at data dog and we're such monitoring company and I put a few figures on the slide But what matters most is we run a pretty large infrastructure and we run a lot of communities clusters some of them very big up to 3,000 nodes and Of course as we won large clusters We've had a few issues with Kubernetes and some of them related to accessing services at this scale And that's why I got into I got very involved with with cube proxy and in particular with IPvS mode for for cube proxy So just to get started some very quick background on cube proxy itself so cube proxy is a component that's responsible for Managing the service abstraction in Kubernetes So it's a component that run on every node and it's going to enable access to cluster IP Which is a visual IP from this node And so basically when you try to access a service you try to access this visual IP and Cube proxy is responsible for transforming this IP into a pod back end If we look at what a deployment and a service look like Here on this example, you have a deployment of three pods and and what matters for for for this deployment Is it has a label which is the app is called eco and then you have a matching service on the right-hand side Which is selecting pods from this deployment by using the selector saying select all the pods with label app equal echoes and When you create all these objects, you're going to to have all this created So first you're going to create the deployments the deployment is going to create a replica set Which is responsible for creating all the pods So in our case we have we have three pods and the replica set is going to manage this this repods So as I was saying before all this pod have a label and the service is matching the same label So it's going to consider all this pod as back-end for the service There's actually an additional component, which is called an end point and This component is completely managed by current it by the control plane you will almost never see it because you don't interact with it and This end point is actually an optimization so that Cuproxy and all the nodes doesn't have to watch all the pod objects, which can be pretty big a pod object is easily Easily a thousand bytes quite often bigger and of course if you have a large amount of pods and a large amounts of service If you need to add in memory description of all the pods on all the nodes, it's going to be pretty big So the end point object is actually a very simplified view of all the IPs backing a given service So in my example here you can see that the end point is a very simple structure with the IP of all the pods backing backing the service There's an additional concept, which is called readiness In Kubernetes you can associate a readiness probe with a deployment and in this case a pod use won't be ready unless it satisfies this readiness probe and the way it works is you can for instance specify an HTTP probe and Connect to a path and verify that it's it's healthy and The way it's used is basically if a pod is not ready because it can't serve traffic It won't be added as an end point. So when an application try to connect to the service, it won't be routed to that pod So if I get back to the example I was giving before With my eco service with three pods if I connect in another pod and connect to a virtual IP The IP in to intend out to handle here. This is the IP of the service I'm actually going to be connected to the pod and as you can see on this slide the pod has just given as an answer Their own source IP and you can see that I'm actually getting to three different pods Which are the three pods backing the service Of course if one of the pod was not ready I wouldn't be ready there and so if of the three pods only two were ready. I would see only two IPs there So this is the very quick introduction on our own on what's Qproxy what Qproxy does so how that this all work So In order for a pod to be added to an end point it needs to be ready it needs to be running on a node and it needs to be ready and The company that's responsible for giving this formation to the control plane is the cubelet the cubelet is the main component running on each Kubernetes node and the quiz is responsible for starting pod by interacting with the container runtime and Also performing the health check if there's a health check for for the pod and the cubelet will All the time give information on the pod that are running on the host and update the status is the pod running is the pod ready all this information and This information is going to be reported to the API server and stored in in it CD which is the data store for for Kubernetes and This is now translated to end point by a very specific controller in Kubernetes Which is called the end point controller and the role of this controller is just to maintain the end point objects so the end point controller is going to watch for all the services and all the pods and Updates the end point object for all the services based on all the pod matching a given label that are ready, of course Now that we have this end point object that is synchronized by the end point controller if we want to access a service from from a node from a container This is where your proxy enters into play So Q proxy is responsible for watching services and the end point associated with each service and To configure something that I call a proxer and you're going to see why I call it a proxer because you have different ways to do that And when the client is going to connect to the service using the visual IP The proxer is responsible for setting traffic to actual pods. Okay, so I'm a client I'm trying to talk to the eco service. I mentioned earlier I'm sending traffic to the virtual IP and the proxer is going to write me either to pod one or but be important to in my example So there are different implementation of the proxer The initial implementation was was a user space implementation. So you can see this as An h a proxy for instance. So you would have a local proxy running and all traffic would be running through it The actual implementation looks like this. So When you have a client, it's going to be rerouted and when it sends traffic to a service IP The traffic is going to be rerouted to Q proxy itself. Okay, so you have an IP table rules That's going to redirect traffic to Q proxy And the way it works is for every service in the cluster Q proxy is locally going to find an available port Bind it and create an IP table rule that's going to redirect traffic for the service directly to Q proxy and Then Q proxy itself is going to do the actual load balancing to pod and connect to But one and point two in my example So the way it works if you look at IP tables on the on the instance themselves You can see that in the prerouting chain. There's this rule capturing everything. Okay, and sending traffic to the portal's container chain and In that chain you have one rule for each service in the cluster and this was actually very simple if traffic if the destination of the traffic is the service IP VIP on the slide to the service port then traffic is redirected to the IP of the node Which Q proxy this is binding on a specific port that's been attributed locally by the proxy amongst available ports So this works fine But as you can imagine, this is not very performance because every time you send traffic traffic is going to be sent from Cannot learn to use the land to the proxy and then back on the interface. So it's not great Also, if you do that, there's no way to actually keep the source IP if you're accessing the service from the container The IP you're going to see at the destination is going to be the node IP of Q proxy itself because of course Q proxy is going to initiate the connection to the backends So this is not recommended anymore. I'm pretty sure if some of you are running communities today, you're not running in that mode It's almost it's it's kind of deep you create deep you created But it's definitely not recommended anymore and the current default implementation since Kubernetes 1.2 is IP tables, which I'm going to talk about just right now So the first the first mode was user space and the second is IP tables and this one is the default and once again If you're running Q proxy, it's very likely. This is the mode you're currently using and it's the one that's used in most managed offerings If you look at GKE or EKS for instance on GCP NWS, this is the mode they they use So in IP tables mode We still use IP tables for redirection, but we're not redirecting traffic to Q proxy We're directly using IP tables to redirect traffic to to backend pods So if you can you can send this example that we're sending traffic to a virtual IP and this virtual IP is Dnated to a pod IP Okay, and when the traffic comes back you hit the contract and you're reverse-nated to to the pod itself So it looks pretty simple when we see it that way. It's actually a bit complicated So the design is is it this one? So basically same as before I Q proxy will hook into the pre routing chain and the output chain for local traffic and Everything will be sent to a chain called Qube services and That send that chain will do the same type of matching as we had before So it will match the cluster IP and and the service ports and send traffic to a chain for the service itself So what's important here is you have a new chain for every service Of course if you have a lot of services you're going to have a lot of chains and your IP table configuration is going to be kind of a mess In that chain you have one rule for each backend and this is where things start to be a bit hacky As you can see here Q proxy is using the statistic IP tables model to to randomly send traffic to the backends and The way it does it is well There's a probability for the rule to apply and if the rule apply you route it to a pod As you can see when you read it for the first time. This is not very intuitive because instead of seeing One third one third one third which you would expect if you have three backends It's actually one third for the first one because there's once one chance out of three the rules going to apply But then if you if the rule if the rule is not matched You only have two backends and so that way it's 50% for the second one and finally You have a chain for each endpoint for a service and this chain is just responsible for doing the net itself so modifying the destination IP to use the pod IP and It's also doing something that's a bit surprising when you look at it the first time For traffic that is sent back to the pod Where it was coming from so happen traffic we need to use s-nets So let's let's look into that because this is a bit complicated So imagine I'm in pod one and I want to access a service that's backed by pod one, okay? and other pods if traffic if I just do d-nets The after d-nets the traffic is going to have the same source of source IP and destination IP and of course This will not work Great, and so in that case what IP table does is it's doing source nodding to the host IP So traffic can be reverse nodded back to the destination Another thing you can do with Covenanted services people tend not to use it much and I think it's a good idea because it's a bit complicated too Is to use affinity so what you what you can't want for the application is for all traffic from the same source IP to get to the same backends because you have some sort of affinity constraints and As you can imagine doing persistency with IP tables is not easy So the way this is done is using the recent module which usually people use for security to avoid port scanning or DDoS and In that case what happens is when you do not traffic to a pod IP You also instead this source IP into a specific set Specific recent set with the same name for for the endpoints and then the next time you eat the service chain Okay, you remember this chain where you had all the load balancing rules In addition to the load balancing rules at the beginning of the chain You have these rules there that are checking if the source IP is matching against the set that exists And if it does traffic is directly going to be sent to the matching endpoint without using without going through load balancing rules So this works, but this also feels quite I mean a bit hacky and and It's very difficult to to expire connections and do these kinds things So in terms of limitations While using IP tables to do load balancing is kind of a surprising idea It works Surprisingly well though, and it's been running for many. Most people are running committed this way It's very hard to debug though Very quickly you have tens of thousands of IP table rules It's very hard to understand and to be honest of the last time I was running In IP table mode on a mid-size cluster of about a thousand nodes We had I think 50,000 IP table rules and when you want to debug it and listen what's happening. It's it's pretty tricky Worse There's actually a huge performance impact those on the data plane and the control plane So here I'm just quoting from a very interesting talk from Kupke in Berlin 2017 and This is the impact of IP tables on routing performances So what happens is of course if you have a lot of rules to traverse before getting to your end point This means It's going to take some time to go through all the all the rules in the chains and As you have more more backends and more services. This can actually take quite some time You can see in this in this extreme example here with 50,000 services that at one point. It's taking like About seven milliseconds just to get to the chain So you want to establish a connection and just going through the IP table chains is going to take you a few milliseconds And the worst part is actually note the data plane is the control plane Because the way Qproxy works when used in IP table mode is every time There's a change in endpoint of services is going to re-compute the full set of IP table rules And as I was saying before this can be tens of thousands of rules and do a single atomic reloads of all these rules So it's fine on mid-sized clusters and small-sized cluster, but on large clusters This very easily takes a few minutes So if you if you look into issues in under communities GitHub you're going to see that many people are actually getting time out errors because by default Qproxy time times out after five minutes when trying to read all rules But it's very easy to get to to go far above five minutes You can see this example that with five thousand services It takes more than ten minutes, and if you reach like twenty thousand services to proxy becomes completely unusable In this example here it took them five hours to just reload one set of rules As you can imagine on large clusters endpoint tend to move quite a lot And so you would want things to be like updated in the matter of a few seconds So five hours is definitely too long and this gets us to IPvS So the idea of IPvS is well IPvS is load balancer built into the kernel And so of course it's designed for load balancing and the way it works is actually Pretty pretty logical for each service you have a virtual server in IPvS backed by real servers And when you initiate a connection Your application is just going to send traffic to to IPvS IPvS going to select the backends and traffic is going to be routed to this backends and I'm just mentioning the IPvS contract because as you'll see later. We've had some issues with the IPvS contract When a pod is deleted It's removed from the real servers. So you could you can see here that Backend X is not no longer a server for service S And what happens in that case in the kernel is traffic to this backends is going to be to be dropped so this is not ideal and The way we actually address it with IPvS is we use the CCT that I put on the top Which is going to make sure that any new packet to the assist of this connection is going to be to clean up The IPvS contract and trigger a reset to notify both backends So that's works pretty great, but it's still not ideal. This means we don't have any kind of graceful termination So imagine you're Currently connected to an HTTP server and downloading data if if the pod is moved to terminating states Usually what you would expect is for the communication to terminate and things to continue fine afterwards But in this case what your proxy was doing in the first IPvS implementation was just remove the real server So the connection would be abruptly abruptly cut This wasn't an issue with IP tables because even if you remove the IP tables rules your entry was still hitting the contract So traffic was still flowing fine after removing the backends So addressing this took some time But we had an implementation of graceful termination in communities 1.12 and And the way this works if you're familiar with IPvS is pretty pretty logical So what we do is one of pods is set to terminating mode We update the weight of the real server to zero So no new connection will be established as backend but by established connection are still going to work So this work a lot better For garbage collection, it's actually easy the way it works is We have a thread that's running every every few seconds every minute. I think that's what that's looking for all the backends for a given All the connections for a given back end and when this gets to zero The back end is removed That's that's perfectly fine because it's removed as pods go down. They send a fin and the connection is is is removed There's one one small issue which is what happens if the back end is actually crashing and you don't get a fin and the connection is not Properly shut down and in that in that case you have this typical contract issue That's not specific to IPvS, which is well the application is not going to notice that the back end is gone Until either the entry expire in the in the contract or it's it retries sending packets until it detects That the connection is tell which by default takes 15 minutes on on mostly in most Linux installations and if you want if you encounter this kind of issues what what we recommend in the Cuperoxy IPvS community is to lower TCP retries to To to actually detect that the connection is felt much much earlier IPvS connection tracking has been a bit complicated As I mean, I'm sure most of you know all the issues you can have with the contract in Linux IPvS as its own connection tracking system and It's in terms of granularity It's much less granular than the standard contract and you can see that the default timeout as pretty high and The main issue we've seen with this is actually the UDP timeout, which is set by default to five minutes and It's this was very bad especially for DNS traffic Because as you can imagine any DNS query will actually get an entry in the contract and it would be kept for five minutes So that's one of the reasons we decided to disable graceful stimulation for UDP and only do it for TCP So for UDP if as soon as the back end is removed, we just remove the real server It's not perfect, but it's much better and it solves a lot of issues One of the things we've added very recently to the IPvS implementation is an easy way to set timeout in your configuration And we're probably going to change the default timeout soon, but we've been very careful Who's doing bad because we don't want to break existing installation? One of the main thing we plan to do is probably to change the default UDP timeout to 30 seconds Because the largest use case by far in communities environments of UDP connections is DNS traffic And 30 second timeout is much more than enough Ideally what we want is be able to set the way to zero when the pod enter when a back-end pod enters terminating states and Just remove it when the pod is deleted, but the current endpoint API doesn't allow this So very quickly in terms of status for IPvS It works pretty well at large scale We've been running IPvS on very large cluster for some time and it's been very okay for us We didn't encounter any of the scale issues. I was mentioning before Be careful though It's not a hundred percent feature parity with IP tables implementation, which is still the reference one And it took us sometimes to tune all the IPvS parameters to make it to make it better But we're getting there a few things I wanted to mention that are common challenges for Qproxy regardless of the implementation So the first one is the scalability of the control plane So you remember before that the interaction of Qproxy with the rest of the cluster is using the endpoint object And the thing is every time there's an update to a service, a new pod, a pod changing becoming ready or not ready Then the full endpoint object is recomputed and this full endpoint object is sent to all Qproxes So it's fine if your endpoint object is small enough But as your endpoint starts to get big if you have for instance in my example 2000 backends This can lead to a lot of traffic. So in my example here imagine you have a service with 2000 backends Each node will receive 200 kilobytes of traffic for any given update Okay, so that's quite a lot and of course the API server needs to send this information to all the nodes in the cluster Which means well in that example about a gigabyte of traffic So that's quite a lot but worst if you're using if you're doing a rolling update Which means you're going to update all your pods in the in your 2000 back-end service It's mean one by one. They're going to be deleted and replaced Which means at minimum you're going to do what I was saying before 2000 time for 2000 backends Which is going to which is which means you're going to send two terabytes of traffic and this is of course quite a lot So this has been addressed very recently in Kubernetes by using endpoint slices So the idea is that instead of using a single endpoint object for large services An endpoint can actually be backed by multiple slices and the maximum size of the slice is a hundred endpoints And so you only need to synchronize the slice where an endpoint change happens, which means it's much more efficient So this is still in Vera, but this has been available since 2017. So it's still pretty recent another very common issue We've seen with Kuprox implementation in is the size of the contract All the current IPvS implementation, oh, sorry Kuprox implementation Relay on the contract, which means if you have services that get a lot of traffic you create a lot of connection entry in the contract It's especially bad for DNS Usually in Kubernetes you have a service that's supplying DNS to the cluster and it's backed by pods and of course for each query you then create entries in the contract and And And if you run Kubernetes at large scale, what's going to happen is you're going to fill the contract and drop your DNS queries Which is not great So there are different ways to address that a very common Way that's becoming a standard is to use a node local DNS cache on every node Okay so when you do queries you would first eat the local cache and Uptrim queries will use TCP so this is much more efficient and this is What's being used in most set in most new setups today? Another thing is we had a very good Very good surprise with candle 5.0 in this example here you can see that we have a set of Container backing the DNS service and we were very surprised that the number of entries in the contract was very different We had like a bimodal distribution of number of entries in the contract for these nodes providing DNS and We discovered that on 5.0 channel the number of entries in the contract was much lower than that on 415 And the reason is these two commits here that optimize the way the canal does Contracting of UDP in in in canal 5 and this is much better As you can see like the number of entries in the contract was divided by two thanks to these two commits So it was very good great news I'm almost done. I just wanted to mention Very recent features of of Qproxy the first one is dual stack support. So Kubernetes has a whole the supporting IPv4 and IPv6 in alpha since 116 and of course Qproxy is doing it too and The most recent change and the most recent feature is a support for topology aware routing which will allow you to to connect in preference to Local pods local meaning on the same node or in the same zone send data center And not the balance to all the all the pods. So this is still in alpha, but pretty promising and Well in conclusion Qproxy is working pretty well at very significant scale There's still a lot of effort in that place because as I mean this talk was all about the all the problems We're trying to solve with with Qproxy But we're getting there the main issue is IP tables and IPv s are not great match For the service abstraction because they were not designed to the client-side of balancing I mean they work, but they were definitely not designed to do that So we can make them work but it feels hacky and we've had a lot of issues so a very amazing alternative in terms of implementation is EVPS based on balancing and I think some of you where they're earlier to see the talk by Daniel Borkman on How they do the implement the service abstraction in Celium? And this is very promising because well This is e pf base was designed from the ground up to work with Kubernetes and it's very efficient and Actually in data. We're actually moving to this implementation instead of Qproxy in most of our clusters and That's it. Thank you very much. I'm thank you And we're unfortunately out of time so questions would have to be at the room. I'm sorry around if you have questions