 All right, hello and welcome to administering multi-cluster service meshes securely. I'm Aitan Yarmush, an architect at Solo IO. I'm Eric Murphy. I'm a field engineer also at Solo IO. So before we jump in, let's just quickly go over what we're going to discuss in this talk. First, we're going to review multi-cluster service mesh and how it has evolved. First, focusing on single-cluster service mesh and jumping off from there. Next, we're going to discuss the limitations and security issues with the current approach. Third, we're going to explain how we at Solo IO solved those problems. And then fourth, we're going to give an example or demo of how to deploy securely using GitOps. So before we jump into multi-cluster service mesh, let's just quickly take a look back at single-cluster service mesh. So this is a very simple single-cluster service mesh example. We have a couple of different workloads running in cluster as well as our service mesh. In this case, the service mesh is Istio, but that is not necessary for this. So as you can see, all of the workloads including the Istio control plane are communicating directly with the Kubernetes API server. In the case of the control plane, it communicates with the API server to get the endpoint information for all the workloads as well as service info and other metadata about the cluster. So let's just quickly review the single-cluster scenario and what the benefits are and why it's more simplistic. So the control plane and the data plane both live on the same cluster. Authentication with the API server is handled automatically by Kubernetes. So the service account tokens are generated and mounted automatically. And the tokens remain safely in Kubernetes, more on this later. Most likely no need to revoke those. And lastly, configuration can be easily housed in a single repository, leading to simple, easy to understand GitOps workflows. So now that we've quickly discussed a single-cluster Istio scenario, let's evolve, let's jump that up to a multi-cluster scenario and begin to explain the complexities and security implications involved. So now, as you can see, we've taken our single-cluster diagram and multiplied it by five. Now, in this scenario, all of the control planes still need access to the cluster that they are running on. But in order for a multi-cluster service mesh to be useful, the Istio control planes also need to be able to read and write to the other clusters. So what does that look like for us? Well, we've quickly drawn up an example. So now we can see all of the Istio control planes have access to every other cluster's Kubernetes API server. So what does this mean in practice? Well, in practice, this means that every control plane needs to have credentials to access those other clusters. So, as we mentioned earlier, the Kubernetes credentials need to leave the Kubernetes, the single-cluster boundary. So multi-cluster service mesh. The control plane and the data plane are spread out across many clusters. Each control plane needs access to each other cluster. That means we have an order n-squared connections where n is the number of clusters. Authentication is not handled automatically for our remote control planes. So service accounts must be created. Kube configs containing the service account token must be distributed to each cluster. Now, these Kube configs, which must leave cluster borders, give potentially dangerous levels of access to said cluster. If any Kube config credential were to leak, replacing it and redistributing the credentials is a convoluted process. Keeping all of these Kube credentials in sync across all the different clusters is very difficult and not easily done using GitOps. And on top of this, in deployments with many clusters, potentially many control planes are performing IO on each API server, which can slow it down. So the approach that I just mentioned with order n-squared connections, that's one way of doing it. That was the Istio way of doing it beforehand, but a new way is starting to emerge, which we think is a lot better. And that's the external control point. So in this scenario, Istio is running or the service mesh is running on a management cluster and managing a bunch of clusters on its own. And in this way, there doesn't need to be a control plane on each cluster that needs access to each and every other cluster. So there are, of course, downsides to this scenario, such as if the management cluster were to go down, then the control plane would also go down. So it is important to build in redundancy for the management cluster, but you also reduce the number of places that need the CUBE credentials, as well as reducing the load on the API server. So the external control plane, the control planes are now on a single cluster and the data planes are still spread out. Now, we like to call this the push model of configuration. There's one centralized control plane with access to all data plane clusters, so that's order n connections. And the centralized control plane will be pushing config out to the remote clusters. Each control plane will still need access to the API server of each node or remote cluster. And although the number of network edges has shrunk, the same problems from the previous model still exist, albeit on a smaller scale. Still hard to distribute the credentials and the leaked credentials can give bad actors outsized power. So what are our concerns, our main concerns about administering multi-cluster SDO or service mesh securely? One, difficult to store and sync all of the CUBE credentials using traditional GitOps workflows. And two, giving a cluster read-write access to many other clusters leaves a potentially large security threat. So how can we get around some of these issues? Well, the way that we thought about it is something called a pull-based method of config. So if you notice in the previous diagram, all of the arrows were pointing out to the remote clusters, whereas now all of the arrows are pointing in at the management cluster. So it's still an external control plane, but it's pull-based this time. So now all of the remote clusters are actually creating the connections to the management plane. So the management plane doesn't need to have access to any of the remote clusters and neither does the service mesh running there. So what benefits does this bring us? Well, this pull model of configuration, as I mentioned, there's one centralized control plane which needs access to zero data plane clusters. All the data plane clusters connect to the control plane cluster and receive updates that way. The config updates function similarly to the Envoy XDS protocol, for those unfamiliar, we'll go over that in a minute. And linked credentials no longer give the same direct control over the API server. Furthermore, no Kubernetes credentials need to be shared outside of the cluster borders. And since we are no longer distributing these credentials, the remote components we are distributing can be easily stored and distributed via GitOps. So how does this really work? Well, we took heavy inspiration from Envoy's XDS protocol and the Kubernetes Bootstrap token, which I will explain now, but those links are available from the slide. The way that the XDS protocol works is essentially Envoy makes a connection to the control plane or the management server and says, hey, I want some config updates. And then the management server says, all right, great, I've got that for you. And so in that way, each remote instance, each Envoy is responsible for creating the connection. And then the management server is responsible for updating Envoy with the config when necessary. And the Kubernetes Bootstrap token is a way of verifying the identity of different callers. So more on that in a second. So how does it really work? Well, in this case, the server would live in is our management plane. That is the component living in the management cluster and the agent. And the agent is the component that gets deployed to each remote cluster or each data plane cluster. So we start off with a CA or signing cert on the server and a server cert on both the agent and the server. And the server cert will be used to establish the initial TLS connection. In addition, we distribute a Bootstrap token to both clusters. And this Bootstrap token will be used in the initial request with the token so that the server knows to trust the agent. And then once it knows it can trust the agent, it will respond with its M TLS cert and it's with the identity. And this M TLS cert is how the server knows what data to send back to the agent. So once that identity has been provided, then we can open our config streams. So the first one we have is a remote resource stream. And this will stream the resources from the data plane clusters from the agent to the server so that it can make decisions about the system. And then the second one will stream the mesh config or the config needed to run the data planes to the agent clusters. So can the pull model really solve all of our problems? Short answer, no. But it does alleviate the two main issues that we began to discuss a few slides back. The first being giving a cluster read write access to many other clusters leaves a potentially large security threat. Well, since our control planes no longer need direct access to remote API servers, the Kubernetes credentials no longer need to leave cluster boundaries. This reduces the chance that a bad actor gets elevated access to the system and reduces the load on the API server by reducing the number of clients. The second, difficult to store and sync all these Kube credentials using Github's workflows. Well, luckily, there's no longer any need to store or sync these Kube credentials on top of which the components that do need to be synced, mainly the agent and the token, can easily be integrated with Github's C part two of the talk. So what were those question marks that we were talking about earlier? Well, it just so happens that we at Solo IO have created a product called Gloomesh, which can handle this, but more on that later. For now, I'm going to kick it over to Eric. Thank you, A-Tun. So next, I want to talk about how you can deploy Gloomesh securely across multiple clusters using GitOps and Argo CD. So I have a demo I want to show you today. And in that demo, I have two clusters running. I have a management cluster. And on that management cluster, I have Argo CD deployed. I'm going to use Argo CD to deploy configuration from a Git repository and automate the deployment of the Gloomesh management plane. On the remote cluster, I have a separate instance of Argo CD running. And I'm going to use configuration there to deploy the agent automatically. So there are multiple benefits here to this approach where you can take configuration as code, you can store everything in your Git repository. And you can have a repeatable deployment process for both the Gloomesh management plane and the agent. And this allows you to automatically update your Gloomesh deployments with that configuration with Argo CD. Additionally, if you need to spin up new clusters and deploy new meshes, you can do it in an automated way more quickly and more effectively. Additionally, in a disaster recovery scenario, you want to be able to replace your clusters and your meshes as quickly as possible. And you can do that with this approach, even including that management cluster. Okay, so now let's jump into the demo. So what I'm going to show here in this demo is actually how to physically deploy Gloomesh using Argo CD. And so to do that, I have some configurations checked into my Git repository. Everything is deployed from a Git repository using Argo CD. And in that Git repository, I have CRDs that are specific to Argo CD. And so this is an application CRD. And inside of that application CRD, you can actually define values for a helm chart. So behind the scenes, Argo CD is actually using helm to deploy the Gloomesh management plane and the agent. So in this case, I'm using the Gloomesh enterprise chart, right. And I'm setting a license key and some other values such as self sign equals false. That means I'm providing the certificates in advance of the installation. I'm not depending on Gloomesh to generate certificates on its own. Okay. And then if I go over to this other application CRD, I can see the helm chart for deploying the agent. And so here the chart is enterprise agent. And I'm giving it a server address for the management cluster. I'm giving it an authority for the certificates that are used. And I'm giving it a cluster name, which is simply remote cluster. Okay. So now let me walk you through the steps for actually conducting the deployment with Argo CD. Okay, so first, I'm going to switch my kubectl context over to the management cluster. Next, I'm going to create a new name space for Gloomesh on that management cluster. Next, I'm actually going to kick off the deployment of the Gloomesh management plane using Argo CD. And I can do that by simply doing a kubectl apply on that application CRD. Okay, so that's done. So what I can do is go over here and log in to the Argo CD deployment on my management cluster. And you can see how quickly that was actually deployed. Everything is synced and healthy already in a matter of seconds. And if I click in here, I can see everything that was deployed for the Gloomesh management plane. Okay, so you can see there's a lot of different things deployed here, right? Everything is green. Everything is deployed. Everything is healthy, including all the pods for the dashboard, the enterprise networking. And so that all looks good. As you can imagine, you know, Gloomesh, the management plane is, you know, a little bit complicated because it does manage multiple clusters with multiple meshes. And so there are a lot of capabilities built into it. Then if I step out here and look at the rest of my application deployment, I can see configurations for the Gloomesh management plane. So here I can see that there is a RIT certificate that is stored in a secret. I can see that there is a TLS certificate for securing connection between the remote cluster and the management cluster. There is a signing certificate also stored in the secret. And that allows the Gloomesh management plane to generate new certificates that can be passed to remote clusters to establish an MTLS secure connection. Okay. Also, there is a token that is created. And I provided this token in advance. So that token is deployed with the Gloomesh agent. And that is used for making that initial connection from the remote cluster to the management cluster, as that was discussed previously. Finally, I have a configuration called Kubernetes cluster. And that simply defines the remote cluster that is configured to be registered by the agent. And I'll walk through how that works. Okay. So let me go over here to my console and I'm actually going to kick off the mesh cuddle command. And so that will allow me to see what is currently running, what is currently registered in the Gloomesh management plane. Okay. So as I mentioned before, there is a Kubernetes cluster CRD. You can see that the cluster is there, right? But if I go in here, there are no meshes that were discovered, right? And that is because the agent has not yet completed registration with the Gloomesh management plane. Okay. And so that is going to be the next part of the demo. Okay. So let me go back over here. And so what I'm going to do is I'm going to change my cube cuddle context to be for the remote cluster. I'm going to also create a Gloomesh namespace there. And I'm going to deploy that other application CRD specific to the Gloomesh agent. Okay. So let me paste that in there. There we go. All right. So now let me go over and log into the other instance of Argo CD. This is not the same Argo CD. This is the one deployed on the remote cluster. Okay. So I'm going to click sign in here. And here you can see that everything that is needed for the Gloomesh agent is deployed, right? So that also happened very quickly in a matter of seconds. You can see that there's enterprise agent pod that's running right here. So that is green, that is healthy. So everything looks good for the Gloomesh agent deployment. Then if I go over here and look at the rest of my deployment, I can see the actual configurations that are residing in that repository deployed by Gloomesh. Okay. So if you look here, there is a secret that is deployed with that RIT certificate, right? And there's also that identity token secret. So that is the same token that was deployed on the management cluster. And once again, that token is used to authenticate the remote cluster with the management cluster and establish that initial connection with the management plane for Gloomesh. Okay. And when that happens, it receives back an MTLS certificate and a new connection is established. So there is a fully secure connection for the Gloomesh agent. Okay. And one thing I want to call out here, just for demo purposes, I'm putting in the certificates end-to-get for the RIT certificate. That is not the best practice. Instead, you could use remote secrets. You could integrate with Hatchecourt vault. You could integrate with your cloud provider with their certificate management services. So once again, that's just for demo purposes. Okay. And by the way, all this is stored in the get repository as I mentioned. So if you go in here, you can look and you can see these configurations for the Gloomesh agent. You can see like the token, the token value right here, right? You can see the root TLS secret. You can see the CA cert right here, right? So all that is deployed using that that get ops approach. Okay. Okay. So since we deployed that agent, it should be registered. So let's go back and look at the Gloomesh console again. And you can see now that under meshes right here, I actually have a mesh that's registered, right? It's this IstioD IstioSystem remote cluster right here, right? And the mesh health is green. And I can see that there are some destinations that were automatically discovered on that remote cluster. Keep in mind, this Gloomesh console is running on the management cluster. So if I go in here and look at the mesh details, I can see that there is a destination for the pet store application. I deployed that pet store application in advance of this demo. It is part of that service mesh offered by Istio on that cluster. The sidecars were automatically ingest injected into the pods. And so that was how the pet store was automatically discovered by Gloomesh. Okay. So that wraps up the demo. So let me go back to the slides and just finish up here. So Gloomesh is a really awesome product for providing that multi cluster, multi mesh management, right? But Solo offers other products that integrate with Kubernetes, integrate with service mesh. So we actually have an API gateway product called glue edge that offers integrations with Istio. We also have something called the glue portal. So that provides a developer portal, a user interface, an API specification discovery that integrates with glue edge and also Istio ingress. And finally we offer something called web assembly hub. So that is a place where you can build web assembly extensions or the onboarding proxy, you can publish them, you can deploy them. It's compatible both with Istio and glue edge. So with that, I thank you for watching today. And if you have any questions, please feel free to reach out. Thank you.