 So, thank you very much for coming. I'm Ryo Tosawada. This talk is multi-cluster observability with Service Mesh, and that is a lot of moving parts. So, before going into the details, I want to go through some agenda, which shows up. So, we'll start with the introduction about myself, about Upsider, and about this talk, and what this talk is about, and what this talk is not about. And before going into so much details, we would also look at observability with Service Mesh, what it looks like, with standard setup, which is just a single cluster, and then we would go into a multi-cluster setup. And we would go into multi-cluster challenges before going into quick reviews of each project that I would be using, and then solutions and demo. And after the demo, I'll have a bit of recap, alternatives to consider, and other considerations that I couldn't touch in this talk. So, firstly, introduction of myself. So, I'm Ryo Tosawada. I'm a platform engineer at Upsider. I'm passionate about learning and sharing all things technology. It's the first time speaking in person, in-person event, and it's the first time in-person keep-con for me. It's nerve-wracking as hell, but hopefully things go okay. I have a son and two cats, and you can find me at RITSWD, which is my full name, without vowel, and you can find me in usual places. And about Upsider briefly, it's a fintech startup. We are providing between payment solutions. We are adopted by 15,000 plus companies. The headquarters is based in Japan, Tokyo, Japan, and the business is mainly in Japan, but the remote team is all around the world. I myself am based in London. We have a few folks with us today here, and we have been cloud native end user from early 2019, and we use multi-cloud and multi-class system for the production setup. So, that's why I wanted to give this talk, and we are hiring, so if you wanted to know more about us, please come to me. So, about this talk, so a little bit of prerequisites. So, I want you to at least have a basic idea about observability, what it is about, okay, it's about metrics, traces, logs, and then service mesh, what service mesh can provide. It provides traffic management, security, but this talk is about observability. So, the talk is going to be mainly about observability metrics aspect, and also service mesh in terms of observability. And this talk is about, so I'm just going to talk about that, service mesh behaviors and multi-cluster scenarios, observability challenges, our learning at Upsider, and it is a lot of moving parts. And this talk is not about Istio deep dive, and it's not a Prometheus deep dive either, and there is no silver bullet. I had a couple of questions that I wanted to ask after this slide, or a few other slides later. Who uses service mesh, or who has used service mesh in some way for testing, okay, that's great. About like 50%, who uses service mesh in production? Oh, like higher hands, that's amazing. But probably like 40% or so. Who has multi-class environment, doesn't have to use service mesh or like just multi-class environment? Quite a few. I'm not sure if I can give you anything new in this talk, but who uses Istio? Who doesn't use Istio? Okay, I'm really confused what the audience is like. It's like so many Istio folks, so many known Istio folks, but I hope I can show the slide and tell you something about service mesh and observability. But what else I can talk about? Maybe a bit about myself, about my company. Am I a cat? Did I hear that? Sorry? No, do you know more about my cats? My cats are 7-year-old, I think 6-year-old, domestic, British cats, I guess that's something that I'm very proud of. I have Instagram for them, I never use. So what can I talk about without a slide? All right, back to it. So it's not civil bullet set up, I want to just show one solution or a few solution cases, and it's about just understanding what challenges there will be. And you can find, I just created this repository, so I hope this is available, but the GitHub repository is my username RIT SWD and keep on dash EU dash 2023. So observability with service mesh. So standard set up is that just one cluster with service mesh, but what does observability really look like? So we would focus on a single cluster first, and we would deploy Istio and Prometheus and visualize connections and see how it really looks like. So let's see how deploying Istio control plane looks like. So looking at the cluster one at the bottom, you can see that there are only three application containers, and with Istio control plane, you can see that Istio side car is injected. We might have to restart a few things, but that's the idea. And Istio control plane then connects to side cars, and side car will get routing details, credentials and metrics, and with that, data plane, Istio control plane, sorry, there you go, Istio side car, and the data plane can connect to each other. So that's the white lines here, that Istio side cars are connected to each other, so they can finally talk to each other and also have the metrics. And let's see how deploying metrics, deploying Prometheus looks like. Deploying Prometheus would have Istio side car injected in this case, and how does it look like with the Prometheus retrieving from each Istio side car or unvoiced side car. So the orange lines are added, and you can see that it's quite a bit happening in the diagram. And also just let's add ingress gateway. And ingress gateway doesn't have to be multi-cluster related, but it can be just ingress from outside the cluster. And if you add that, it's quite a bit of moving parts, isn't it? So let's get into multi-cluster scenario a little bit more. So let's add another cluster, and deploy separate Istio and Prometheus. So this is only one pattern out of many, and it's definitely not the silver bullet solution. But why do we do it? We have the better HA and higher uptime, if you have multi-cluster, and we can do multi-cloud, multi-managed Kubernetes. If you have only single cluster, you can only do so much, but with multiple clusters, you can do a few things more. And one of the things is that you can migrate one of the workloads from one cluster to the other, while you're doing maintenance with one cluster, or maybe you want to create a new cluster, you have some region setup, stuff like that. One of the things that at upside that we do is that we also separate clusters based on sensitive information management. So we kind of have a separate cluster for specific regulatory auditory requirements, so that's just one of the ways that multi-cluster can help. Okay, multi-cluster, what does it look like? So let's deploy Istio control plane and Prometheus together this time, not so much different to the previous diagram, so it's essentially just sidecar added, Prometheus is added, in this case, another Prometheus in the cluster 2. And deploying Istio ingress gateway, what does it look like? We definitely need that for multi-cluster in Istio environment. And let's connect that, connect between cluster 1 and cluster 2 in this diagram. So this particular line, that these particular lines are only for cluster 1 to cluster 2, and you can kind of visual, like kind of imagine how the connections would look like in the cluster 2, having the same connections back to cluster 1, it was just too many moving parts. I just showed this, it's a bit too much for me, it was so much work to just get this up and I just didn't have the time to do the cluster 2. So multi-cluster challenges, it's a lot of moving parts and a few things that I want you to kind of take, like the key takeaways of multi-cluster challenges and what to think about in multi-cluster scenarios. So first of all, inter-cluster metrics. So traffic leaves from cluster A to cluster B, A to cluster C, cluster 1 to 2, 2 to 3 to 4, and each cluster can have its own observability setup. You don't have to, you can, it's just a choice. But who will be responsible for inter-cluster metrics? If the traffic goes from cluster A to cluster B, is that the cluster A's problem? Is that the cluster B's problem? Who would need to gather the data, gather the metrics? And how can we merge multiple metrics from a number of clusters? In this diagram, it's just two clusters. But if you have like 10 clusters or hundreds of the clusters, how do you make sure that you can actually merge them and use them in a meaningful way? Another challenge is cardinality. And cardinality is not a challenge for multi-cluster environment, it's more of a challenge for observability as a whole. But service mesh definitely allows getting detailed metrics, and it's a definitely great aspect of having Istio and service mesh solutions. But there will be more if you have more clusters, obviously, that's an easy guess. And observability dashboard will be difficult to manage if you have that much data. So that's another thing that you have to consider. Maybe that you have single cluster, single solution not so complicated, might be easy enough, but cardinality can be a difficult aspect as well. Give me one second, okay. And retention is the part that I will be looking at a little bit more closely. So latest metrics can be fetched from running services like Prometheus in this diagram, but metrics need to be retained for meaningful historical analysis or investigations. And where and how should we store them, store those data? Should we store that in a single location? Or should we store that next to the cluster? Or what kind of mechanism should we deploy? If the cardinality is too high, data retention costs can easily get up as well. And data source. So because it's a multi cluster scenario, there could be one service deployed in the number of clusters. So you have application A, which might be deployed in cluster one and two and three, and you might have cluster four and five and six having another different application. You need to differentiate which cluster has which application. So from that sense, dashboard needs to be aware. Like we were to have dashboard, and I think that's a part of the observability. But if you were to have dashboard, you have to have the data source for all the metrics, making sure that you know where the metrics are coming from. Metrics deduplication is something that you can work out. But if you don't have the data source, you might have the deduplication in the wrong way that you lose some data. So it's one of the things that you need to consider. And quick review on key projects. So we'll be looking at Istio. And I saw quite a few of those folks here are using Istio and some other solutions as well. But we'll be focusing on Istio just because on the upside, we are using Istio and have been using it for a while. The few things that you need to kind of focus on is the MTLS being the key part of the multi cluster. Various multi cluster models are definitely available for Istio. But we'll be looking at just one specific solution because we don't have the time to go through all of that. And Sidecar and Ambient is something that's pretty hot right now, but I'm not going to be touching on that. Prometheus is another project that most of us probably use day to day. It's a graduated project from CNCF from 2018. And it's a second graduated project from following Kubernetes. We would need to look at remote read and write APIs. And that's something that I'll be touching on a little bit more. And also Prometheus operator is something that I'll be showing a little bit more. And TANOS is going to be the one that I'll be kind of focusing on. It's not the only solution for sure, but it's one of the easy kind of getting started long-term storage for Prometheus. And there is this Sidecar or receiver approach and we'll be looking at one specific way for TANOS usage. And let's go into solutions and demo. So the overview of the solution that I'll be demoing, it would be that we would have three clusters, three Kubernetes clusters, and those would be using kind clusters. So Istio, I'll be deploying with manifests. I'm not going to be using Istio control or other tools like Helm, just because that's how upside we manage Istio installation. And we'll be looking at multi-primary on different networks. So there are a few other kind of models with multi-cluster. You can check them out in Istio official documentation, but we will be looking at multi-cluster with multi-primary on different networks. StrictMTLS is something that we will look at and also we will be using Sidecar deployment. So I talked about ambient and that's something that I'll be touching a little bit later on. And also with Istio, you can get started with demo profile or sorry, demo profile to get the feeling of what Istio can do, but we'll be using non-demo profile so that it can kind of give you an idea what it would look like in production setup. Prometheus will be using Prometheus operator to deploy Prometheus instances and federation is one aspect that I will be looking at. RemoteRite will be used as well. And for Thanos, deploying using Helm and using receiver rather than Sidecar. So the demo time. So as I said, it's the same address, github.com, RITSWD, keepcorn-eu- 2023, that's the repository with all the details and I haven't had the time to really refine it all the way through. So I'll be probably working for the rest of the keepcorn to make sure that all the materials are there, but bear with me for now. So I think, no, what can I, how can I, can I use this mic so you can hear me still, yeah? Okay, great, thank you. So in my repository, I'm going to run demo.sh, so let me just go back to the github. So this should be public, keepcorn-eu- 2023, and it's got a few things, and I'm going to be probably working on the demo steps and all the details, but for today, because of the time limitation, I'll be running through with a script to make it just, you know, easier to understand and easier to kind of walk through. I don't have to copy-paste them and all of that. So with this repository setup, just going to move back to my terminal, is that too small? That's too small, isn't it? Is this big enough? Yeah? All right, so we're going to start with demo.sh, and the script is going to just give me all the comments about what it does, what it would try to do, simple commands, make directory, and the CD into it, and this actually is already there. So I'm going to quickly show that what I have done before this. So I have created the Kubernetes clusters, so kind clusters are already created using, sorry, another script called demo kind prep.sh. So I have created the temporary directory and created kind clusters. So that's something that you can do if you want to follow along, but for the time, the limitation of the time is just going to move on to the demo.sh, which is going to be about installing all the aspects about multicluster observability. So let's see what we got in the directory. So we just got some Qt config, some cluster definitions, cluster setup definitions for kind, and meta-lv setup, which is just making sure that the kind can work as a production cluster, not really production, but I think you know what I mean. So with that, let's just copy the demo repository with the tarjz, and once that's in place, I don't have to do all the git and curl anymore. And first step is to create the CES certificate. So as I said, MTLS is an important aspect for multicluster. The two clusters, cluster one and cluster two, if we were to have only two clusters, you still have to know that each cluster can securely connect to each other. So CES certificate, managing the certificates is one of the key aspects. And before getting into all the installation, we just make sure that everything is prepared. So first step is to copy the script that's provided by Istio repository. Istio has this tools set, and I just copy that, and make use of it. It's a make file, and I just make dash f and use the make file. And if I do that, I get the root certificate created, and I create three certificates for three clusters. So simple enough, just use a make command for three times with different names, and just take a few seconds to finish. And now we have all the certificates. You can follow along and to see what the files are created, like what kind of files are created, should be really simple to follow. And because we need the certificates in place before Istio installation, I just create namespaces for Istio before, and also make sure that the secrets are stored. So very simple. A few files involved. So root certificates, certificate chain, a CS certificate for each cluster. You can see the actual, like, how the certificates look like in each cluster. Very similar, but obviously different cluster, different certificate, but the same root certificate. And we can see that the CS certificate sets. Secret is created in the Istio namespace, and now we're ready for installation of Istio. And first of all, I said that I'm not going to be using Istio control, and that was because I wanted to make sure that something that you can take away from this talk is that go to the repository, you can see the actual manifests rather than some magic happening behind the scenes. So what I do is that I have copied the repository. So I'm just taking the manifest directory and the manifest slash Istio installation has all the files, and with those files, oh, sorry, before that, I need to label Istio system namespace with the network topology. So as I said, I'll be looking at three different clusters with different network. So multi primary with different network is something that I will be focusing on. And from that sense, I need to make sure that labeling namespace is in place before installation. And that's part of the official documentation as well. So simple enough, Istio system just label that namespace with the topology of each cluster labeled. And now we can install Istio D. And that means Istio control plane and keeps control apply with the context of each cluster and then just take out that Istio slash installation slash Istio D manifests that you can find in the repository. So this is really fast, but you can see quite a few errors happening. And this is because that CRDs are part of the definitions together. So there's a bit of race conditions. So if I do apply twice, I would answer that question later on. And you can see that the files are created. And there was a question about Istio control and I'll touch on that a little bit later. And install Istio gateway in each cluster. And this one is also another configurations already in place. The manifests are all in Istio slash installation slash Istio gateway manifests and just create this. And this one is pretty straightforward. All the CRDs are already in place. And we move on to establishing multi-cluster connection. So first of all, copy the gateway resources that we'll be using for each of the clusters or similar slash manifest slash Istio slash usage this time and then we will look at cross network gateway.yaml as well. It's a simple file with just a couple of things for Istio to know that the multi-cluster communication work. So the resource gateway is created with this for each cluster. And we will move on to creating remote secret. So this one, another one, sorry, so for creating remote secret, you have to have a connection between clusters, but you do it one by one. Cluster one connecting to cluster two, cluster two connecting to cluster one, so on and so forth. And this particular one is going to be cluster one to cluster two. And there was a question about Istio control. I'm not using Istio control because there is a lot happening behind the scenes and it just feels really nice to get started with Istio. But what I wanted to show is that what actually happens behind the scenes and not exactly to just rely on the tools but kind of have a bit of understanding of what service mesh really looks like. So this one is also using cube control. You can use Istio control to create remote secret but because I have the kind clusters with the cube config that you could see with the prep for the another script, we have this Istio sorry Kubernetes configuration in the file. So we'll be using keep control with create secret using the same kind of convention with Istio create remote secret and creating the secret and also annotating and labeling it. So that's what happens with Istio control. We'll probably not use keep control for everything for production use but this is something that maybe it's interesting to see what actually happens behind the scenes. And we do that for cluster two to cluster one. So you can kind of see the pattern okay cluster one to cluster two and then cluster two to cluster one. Same thing and cluster one to cluster three. So as I said, there are three clusters. So cluster one now can talk to cluster three with this remote secret and cluster two can also talk to cluster three with this secret. There you go. And I'm not going to do cluster three out to cluster one and two because you can. You can just omit it. You can say that cluster three doesn't need to talk to cluster one and two. You can do it. It really depends on business use cases. So for that reason, I'm just going to skip it so that it's something that's possible. And you might have some data protection issues or protection requirements like us that you might want to have specific connection one way but not the both ways. So the next step is Prometheus not Istio. So install multi cluster Prometheus setup is next up. And first of all creating monitoring name space. Can be anything just name monitoring for now. I think it's one of the easy to understand namespaces. And label the namespace making sure that Istio knows about it. Istio knows how to inject the sidecars to each of the ports there. Labeled with Istio injection. If you have revision, you might have a different net label. But this should be straightforward. It should be the simple setup for now. And copy all Prometheus related definitions from the repository once again. And Prometheus operator is something that we'll be using and we are using customized build. Prometheus operator is very simple to install usually. So you don't have to do this customized build. But this specific case, I'm installing Prometheus operator into monitoring namespace rather than default namespace. So if you do use Prometheus operator's default, it goes to default namespace. It might work for you. It might not work for you. Maybe that's something that you can kind of check out how it works. It's very simple, but I just wanted to make sure that nothing gets in the way into default namespace. So that's applied for cluster 1, cluster 2, and cluster 3. And that was just deploying Prometheus operator. Now I need to deploy Prometheus itself. And I'm starting with Prometheus for collecting Istio metrics. And I say that because I'll be deploying two different types of Prometheus. So first of all is Prometheus Istio metrics collection. And it will be directly connecting to Istio. And I get that Prometheus setup with Prometheus operator. So I'm using CRDs for that. And deploying Prometheus for federation. And this helps with the cardinality. This helps kind of the retention setup. So it's another Prometheus CRD setup, CRD configurations. So that's why I started using Prometheus operator so that I can kind of have a similar setup for Prometheus collection, Istio collection, and also Istio federation. And now it's installing TANOS. So TANOS is, as I said, it's a retention mechanism. But in this case, I'm installing TANOS only to cluster 3, not to other clusters. Do I have time? Oh, I don't have too much time. So Helm install with just directly from Bitnami. And receiver set enabled to true. And straightforward, just Helm install. So very simple from other aspects. And if everything goes well, it's there. So I didn't install TANOS for cluster 1 and cluster 2. The idea is that this particular solution is to try to connect Prometheus instances, especially the federation ones, to cluster 3. So it's a bit difficult to remember what cluster is what. But cluster 3 was the only one which had TANOS. This is to kind of demo that Prometheus in other clusters can send all the metrics into single observability cluster, kind of central observability cluster. That's one way to manage the really a lot of moving parts with a lot of moving metrics. So that's done with the demo script. So let's see what the cluster looks like. So cluster 3 was the one that has TANOS in. And if I can see how the processes look like, TANOS is just starting up. So meanwhile, let's just do cluster 1, which has Prometheus and Prometheus IstioCollector, IstioFederation and Operator as well. And this IstioFederation will be talking to a TANOS receiver, which is up and running in the cluster, kind cluster 3 right now. So let's see what it looks like with TANOS Query Frontend. What I'm going to do is to go into Query Frontend. I'm going to port forward with the Query Frontend specifically. So it's a bit difficult to see. But 10902 is the port for the web UI. And if I can go back to localhost10902. So if I do Istio requests and some metrics coming through, and if I execute this, you can see the metrics coming through. But there is a problem with this demo that I need to work out so that you can see that the cluster showing up as cluster 3, I think. But I don't have anything coming from cluster 1 or 2. So there is a bit of service mesh setup problem that I have encountered in the past few hours. So I will try to fix it by the end of the cube cone. But sorry about this. The demo, I knew that demo was not going to work. It's a lot of moving parts. But I think you can get the sense that Prometheus and Thanos can definitely work out to help the kind of metric setup. And the idea is the idea about central observability setup is something that you might be able to use. It might not be the right solution, but that's just one way. So going back to, I don't have too much time, so going back to the slide. So a bit of summary. So quick recap on what I have shown you with the demo. So there are a lot of moving parts, and I couldn't really get hold of moving parts and just couldn't like basically failed with it. But, you know, a few clusters. I only had three clusters, and that already was a lot. It was a lot to go through all the steps. And if you have more clusters, it will be really difficult to comprehend. Multicluster is definitely made easy by Service Mesh. But it doesn't really solve everything. Istio definitely certainly has the focus on supporting multicluster scenarios few different ways. But observability is still complicated. Service mesh is only a part of story, and it is not a solution by itself. You have to still configure a few things about observability. Solution with central observability cluster was something that we looked at and is one of the many options for multicluster environment. But it is probably easiest to kind of wrap your head around what happens with it. Multicluster observability challenges, I'll be putting more challenges for more details. And you can check out the demo setup in the repository. And just a quick few alternatives that I need to mention. And Cilium, Envoy, and Linkadee are definitely Service Mesh solutions that you can definitely use. I just was most familiar with Istio. So that's what I stuck with. TANOS was the only just used for demo purpose. And I don't say that it's no production ready. Definitely production usable. But you can have a look at other options as well, Cortex, Grafana, Mimea, and others. Other considerations that I didn't have the time to touch on is that alert management is something that would be really difficult to kind of manage. So alert should be managed per cluster in general. But what happens if you have a central observability cluster? Do you want to have another setup? Maybe that's another possibility that you can consider. Configuration management, we could see that there were quite a few steps in the script. And maybe GitOps or some solutions would be helpful. Observability aspects, like I didn't touch on, and service mesh specific solutions like Iali for Istio, Hubble for Silium is definitely to be looked at. Multitenancy, there was a talk right next door, I don't know which way, about multitenancy. So if you're keen about multitenancy, please do have a look at the recording there. And open telemetry is something that I couldn't touch. And sidecar or sidecar less solutions is something that will be kind of a big focus in the service mesh ecosystem. And there will be quite a few talks in KubeCon. So with that, just a quick appendix. There are some talks about service mesh, open telemetry, observability and all of that. So maybe something that you would be interested to join after this. And some references once again. So the positive is that I'll be putting the link to the slide. And slide is available with double SH, KubeCon EU 2023. MCO stands for Multiclass Observability. And special thanks to the team with me from AppSider, helping with the presentation and slide decks. With that, thank you very much for the time. And that's it. I think there is no time for questions. I'm not sure. Yeah, I don't think there was time for it. So thank you very much.