 Hi everyone, this is Pauline and Dave talking from KubeCon about securing your cluster to cluster traffic the agnostic way. Dave and I work for Workday which is a provider of human resources and financial system. A couple of years ago, Workday decided to expand its offering by deploying Workday on the public cloud and for this we built a platform based on Kubernetes, which my colleague Dave is part of this team that is building Kubernetes platform for Workday. Thanks Pauline. Hey folks, I'm Dave Care. I'm a software engineer DevOps on the Silla public cloud team here at Workday. The Silla team is focused around building Workday's Kubernetes based cloud platform and you can usually find me doing things in the gaming tabletop and CTF space. If you want to reach out or just talk, you can find me at Dave Care 95 on Twitter. And I'm Pauline. I'm an engineer in the public cloud infrastructure team after I spent a number of years working in the same team as Dave. When there is no pandemic, you can find me in a karaoke room and you can reach out to me on Twitter at pylalin. So what to expect from this presentation? Well, we will give you a use case for a provider agnostic cluster to cluster communication. We'll give you an overview of a solution to secure your cluster to cluster communication across providers. We will give you a story of how Workday incrementally implemented that solution. We'll give you an overview of the custom Kubernetes resources and operators developed in order to implement that solution. And at the end of the slide deck, you will find links in order to go further by yourself. Technologies discussed in this talk include Kubernetes, of course, but also Helm, Envoy, Console, AWS and Istio. But first, let's see why we need cluster to cluster communication with my colleague Dave. Thanks, Pauline. So cluster to cluster communication, why do we need it? Well, let's look at a sample service. Its name is Ox. It's a service living in cluster one in data center one. And sometimes it wants to send messages to its family and friends in cluster two in DC two. In the old days, Ox would use the public-facing load balancer to reach its family and friends. But as we know, securing public gateway ingress is hard. And the teams had to maintain complex configurations for these cluster communications. This was not sustainable for the platform and infrastructure engineering folks who needed to maintain these complex configurations for any new providers and new regions that we may need to onboard. So what would make the service team, the service and the platform and infrastructure engineering folks happy? Well, being able to deliver secure, easy to configure cluster to cluster communications. So let's look at some of the use cases. At Workday, we have use cases where services may need to move data between different services in different Workday clusters. And these services may be deployed in specific clusters based on their workload type, cluster type, environment type, or they may even be running in Kubernetes clusters, not running on the Kubernetes platform at Workday, but managed by those teams in their own cloud providers. A sample use case here is moving tenant and data between Workday instances. The service needs to be able to move tenant to data between these instances, and this relies on internal APIs that we shouldn't be exposing on the public gateway. To talk a bit more about this, I'm going to have Pauline look into this legacy use case. Thank you. So let's have a look how cluster to cluster communication worked in the legacy world. So as I mentioned, we had we had communications that were private to communication between data centers, but they were set up a doc and they didn't really scale, because as you added more data centers to the Workday ecosystems, we needed to, for example, white list a lot more IPs than we were than we used to. This was okay when we had a limited number of data centers, but as we opened new regions and new clusters in the public cloud, this became harder to manage. Additionally, we were using a public-facing load balancer, both for public communication and internal communication. That load balancer is one that we call legacy load balancer internally because we want to retire it in favor of Istio. The reason why we want Istio is to prepare us for the next level of scale at Workday. There is also a lot more reason why we chose Istio to replace the legacy load balancer. And if you want to learn more on the implementation of Istio for public ingress at Workday, we have a link at the end of this slide to another talk of us where we talk a lot more in depth about why we chose Istio. So first, my colleague Dave will tell you a bit more about the use of Istio. Thanks, Pauline, for covering that use case. So yeah, why Istio? Well, apologies. So yeah, what is Istio? Istio is an open source platform independent service mesh that provides traffic management, policy enforcement, and telemetry collection. It was developed by Google, IBM, and Lyft, and it's a control plane for Envoy Proxies. It lets you manage your networking resources as a set of Kubernetes custom resource definitions that deliver solutions for ingress, egress, and in mesh traffic. And Istio control plane itself runs inside the Kubernetes cluster. So as I mentioned, Istio is a control plane for Envoy Proxies. And Envoy is a high-performance proxy designed for cloud native applications. It was developed by Lyft. It's open source like Istio. It's cloud agnostic. And in Istio's deployment of Envoy, also known as Istio Proxies, they are deployed as sidecar containers alongside your main application container. In your own workloads, you may be deploying sidecar containers for things such as metrics, monitoring, or logging. And you can think of Envoy proxy or Istio proxy in this case as a sidecar for networking, managing all your main application containers inbound and outbound traffic. So why did we pick Istio and Envoy? As you can see from this CNCF survey from 2020, Istio is one of the most used service mesh solutions being run in production today by many large companies. And as I previously mentioned, Envoy is the underlying service proxy for Istio. So it only made sense from a development standpoint that we stick to a single technology in order to reduce the number of technologies needed in order to reduce the number of technologies that an engineer is required to know in order to run and maintain the platform. So that sounds great and all. But how does the little locks actually use these technologies to send messages to its family located in cluster 2, in data center 2? So for example, we have that user who is trying to reach the internal URI. That user is not a valid workday server. And therefore the public, well, the load balancer that we use, we refuse the communication and send an error. It will block that communication. But when the communication is coming from a valid workday servers and the public load balancer, we'll know to allow this communication and we'll service it. When we moved to Istio, we found ourselves in a situation where internal URIs were either blocked for everyone or allowed for everyone. There would have been a way to say that a simple of workday servers are allowed to query this internal URI and no one else. But as we knew from experience with the legacy load balancer, maintaining that configuration takes a lot of time, takes a lot of resources, and we did not want to carry over this piece of legacy configuration. Additionally, using the same gateway or load balancer for both public communication and private communication is something that we wanted to address. Instead of using the public gateway for private communication, we decided to implement a second Istio gateway specifically for private communication so that we have a public gateway for all public communications that everyone can reach and the private gateway, which is not exposed publicly, which is only accessible for a subset of the workday services. The way the communication works is that we were using AWS private link. AWS private link is an AWS service for securing communication between two AWS data centers. So we would have a server located in a cluster in AWS data center one, sending traffic to AWS data center two via the AWS private link. So finally, the little ox can talk to its family in AWS. But this talk is called securing your traffic the agnostic way. And now we're talking about using a service which is available in AWS and nowhere else. So if you were thinking that this talk is fake news because we are not going to be agnostic at all, you would be correct. However, we use AWS private link as a stepping stone toward a truly agnostic infrastructure. And if you be with us, you will learn how we got to retire the AWS private link in favor of another solution, which is truly agnostic. On to you, Dave. So as Pauline mentioned, the name of this talk indicates that we should be showing you a cloud agnostic design for closer to closer communications. But of course, we had to start somewhere. And that use case was to cover AWS to AWS connectivity. For this, we developed the AWS private link operator. AWS private link operator is a Golang operator built on top of the operator framework and delivers two custom resource definitions, private link egress and private link ingress. We also have a service, the private link operator that lives and runs inside the Kubernetes cluster and watches for creations of either of these resources. Once they are created, the necessary AWS resources are realized within the AWS data center that the cluster lives in. So how do teams go about delivering these cluster to cluster communications? Well, one of the key things here was that we ensure that service onboarding is as easy and as automated as possible. So we don't want to have to be including platform engineering and infrastructure folks every time we need to deliver one of these service to service communications configurations. And we also want to be sure that teams can deliver this easily. So we ensured that they use a mixture of helm template configurations, as well as per environment configurations. So how do teams deliver that private ingress, private link ingress configuration that we saw earlier? Well, service teams deliver that as a raw or template Kubernetes resource, where they just specify the AWS resource name of the account where the cluster that they wish to allow ingress from lives in. So in this example, you can see they define a private link ingress, they give it a name, in this case cluster to ingress private link, and they give it a namespace private links. Then down below in the authorized account section, we provide the AWS ARN or account or Amazon resource name of the account where cluster one lives. So in this case, we're allowing ingress from cluster one into cluster two. So now we know that. So now the ox knows how to send messages to its family, but how can it be sure that you receive them? Well, introducing the private gateway that Pauli mentioned earlier. So teams deliver custom Istio virtual service configurations in order to be responsible for the routing for that service. Service teams have the flexibility to add and remove routable paths request headers with a single pull request, which will be later delivered to production. And they follow an existing private link fully qualified domain name template that we provide for them. And this is easy to test with standard Kubernetes tool chains, such as Cube CTL and Helm. So if you're unaware what Istio virtual services, it's essentially a resource that describes how network traffic should be routed around your cluster. And in this example, we see we provide a name ox in private comms virtual service. We provided that template at host that we provide a template for. So in this case, the FQDN or fully qualified domain name of the associated with the private link is oxen.workday private link.com. It's deployed in the private gateway as Pauli mentioned earlier. And then we allow two routable paths that's slash internal API and slash secret ox family store. So if a service in cluster one wants to route to oxen.workday private link.com forward slash one of those red old parts that will then be sent on to the destination service in cluster two, which in this case is the ox family service living in the oxen namespace in that cluster. So did the ox service mention that their family is also quite large and that they can be spread across many data centers? Sometimes they may not even know if they live in an AWS data center or not. So we have covered how to secure your communication from an AWS data center to another AWS data center. But one day strategy is to be multicloud and securing the communication from AWS to AWS is a good start, but it's not sufficient. For example, we may need to secure the workload between workdays on data center and AWS. And you may have seen it in the news that a strategic partnership has been signed between workday and GCP. So in the future, it's possible that we'll also need to secure communication between workdays on data center and GCP or even between AWS and GCP. So how do we do that? Well, for this, we are leveraging the power of console. Console automates networking for simple and secure application delivery. It has been developed by HashiCorp, which has joined the CNCF foundation just last year. It's using Envoy as its proxy, just like its CEO. And it offers a solution for multi-platform service mesh, which is true very important to us because we want to be able to deploy the same mesh everywhere. And it runs inside or outside the Kubernetes cluster. From our first solution, which was based on the AWS private link, we keep the routing of the traffic within the cluster using these to your private gateway. And we keep him for packaging and delivering configuration for internal traffic networking. We introduce a console cluster, which runs outside the Kubernetes cluster in order to secure data center to data center communication. So why did we choose console for this? Well, it's cloud-independent. And as I mentioned, we want to be able to deploy the same solution in AWS and in workdays on data center. And if it's necessary, we want to be able to deploy the same solution in GCP or other clusters or other cloud providers. The fact that it's independent from Kubernetes is an advantage for us because we have applications which aren't running or will not run or aren't yet running on Kubernetes, but they need to be accessible. And so we wanted them to be part of the mesh. The fact that console runs outside the Kubernetes cluster allows us to link applications which are running outside Kubernetes to the applications that are running within Kubernetes. And finally, console is using Envoy, which is the same proxy as Istio. And for us, it's very important because even though we use Istio within the Kubernetes cluster and console outside the Kubernetes cluster, at its core, it's still an Envoy to Envoy communication. We are not changing the proxy. It also means that the experience that we have from running Envoy with Istio, we can reuse some of that experience to run Envoy with console. We are running Envoy everywhere, which is an advantage to us. There are other advantages into implementing a solution that is platform agnostic. Services may be exposed from any data center. If you have a service that is running in a data center and that you need it to be available everywhere, then if you don't communicate within data centers, that service may need to be deployed in every single data center in order to be available. The fact that you have data center to data center communication allows you to deploy a service in a subset of your data center and to have it available everywhere. You may link services offered by different cloud providers. Say you have a service running in AWS that you want to use in another cloud provider, it's no problem. And finally, it's easier to deploy more services across data centers or providers since you are basically deploying the same service everywhere. If we had remained with AWS private link, then we would have needed to achieve the same solution differently in our own data center and in other data centers, public cloud providers that we are interested in. But we are using the same solution everywhere, which makes it easier to deploy. Some notes about console and Istio. It's entirely possible to achieve what we're doing with Istio using console. And the other way around, we could have used console to achieve what we're doing with Istio. However, in our setup, Istio is a lot closer to the application level than console is. And there are some of the Istio's features that we wanted to make available to the service teams. Those features may have been available in console, but with a more difficult setup, Istio makes it easier for us to use both services for the service team. And likewise, when we were reviewing the documentation for data center to data center connectivity, we found that console had better documentation when it comes to running it outside the Kubernetes cluster. So we decided to go with console for cluster to cluster connectivity. And finally, they both use Envoy. So even though we are using different technologies to implement Envoy, it's still Envoy running everywhere. So in summary, Istio is our choice for our Kubernetes platform. It's easy to implement in Kubernetes. It comes with a lot of interesting picture for the service teams deploying on the Kubernetes platform. Console, we use it to secure DC to DC communication. Console comes with a mesh gateway, which only purpose is to secure data center to data center communication. So this intra-DC communication is a core use case for console. So when you deploy console, you will be deploying a number of servers. You have a service catalog, which is offered via a fleet of console servers. You have an ingress gateway, which is intercepting the traffic incoming to the console mesh. You have a terminating gateway for the traffic leaving the console mesh. For example, for the traffic going to the Istio private gateway. And you have a mesh gateway for data center to data center communication. When services one in DC one needs to talk to services two in DC two, what happens in console is that the console agent will query the console server to know where the service is. If that service is found within the same data center, the query will be routed via the ingress gateway of cancel. If the service is not found within the same data center, the query will be routed to the console mesh gateway. The mesh gateway knows where services are located and it will route it based on the server name that it gets via sniffing the SNI, which is part of the TLS stack. The traffic is routed to another mesh gateway using mutual TLS to establish trust and the mesh gateways are not decrypting the traffic at this stage. In order to leverage the use of the mesh gateway, we have created another operator, which is called the coms operator. The coms operator is looking for custom Kubernetes resources of type private ingress or private ingress and when it finds a resource, it will create the relevant service definition within the console catalog and it will create the relevant TSTO resources to allow for proper routing of the service within the Kubernetes cluster. If we look at the bird eye view for cluster to cluster communication works, we have ox in the data center number one, which is sending queries to ox in data center number two. The query will be routed within the cluster via the internal Istio gateway. The Istio gateway will send the traffic to the console mesh. The console mesh will send the traffic to the mesh gateway, which will send the traffic to the mesh gateway or console located in DC2. From there, it will reach the internal Istio gateway of the Kubernetes cluster living in data center number two, where it will eventually be routed to the ox application living in the signal cluster in DC2. Okay, so now the ox has the ability to contact its family and friends regardless of what data center they live in, public cloud, private cloud, or even a cloud provider we don't support yet. All they have to do is deliver a simple configuration to realize secure cluster communications regardless of where their family or friends might live. I think we all agree that's pretty great. So in summary, in order to implement a secure cluster to cluster communication across data centers, we implemented a global addressing scheme so that the teams have a conventional DNS to address each other. We implemented a private Istio gateway for routing the traffic between clusters. We implemented console mesh for securing data center to data center communication. The teams delivering their application, sorry, the teams deploying their applications on Kubernetes can deliver their configuration via handshots. And we have created operators to dynamically create tier down and update ingress and ingress as required. Thank you. Thank you.