 Hi everyone. I'm here to talk a little bit about a case study that I work at a company called Lunar and this is just our sort of adoption story of service mesh in a financial service industry. So yeah, building a scalable compliant multi-cloud bank with a service mesh. Just before that, I'll just introduce myself. My name is Casper. I work as a lead platform architect at Lunar, but I'm also sort of in my spare time, praying around as a cloud native computing foundation ambassador and been in this meetup space community driving meetups in Europe for quite a while, founded something that we call cloud native Nordics, which was like a meetup alliance across different meetup groups in the Nordics of Europe. And yeah, just trying to build the community around cloud native in general. I'm also a ambassador. I guess that's sort of a spoiler alert on the service mesh that we chose. Yeah. The last thing about me, my way into this space around cloud native was actually before it was called cloud native. I did a master thesis in 2016 about Raspberry Pies in Kubernetes and how sort of universities could utilize Raspberry Pies to actually sort of teach students about this new way of doing things. So a little bit about Lunar. We was founded in 2015 as like a challenger bank fintech kind of company. Been venture-backed ever since, got a lot of rounds. We now have offices around the Nordics in Europe, approximately 550 customers and also like a business segment. 650 employees now. It's been growing pretty quickly. And you'll see a graph in a second. So from the more technical aspects, we deploy around 50 changes to production on average every day. So I think that's at least from comparing to the banks in the Nordics, at least that's a pretty high number because they are probably doing it once a week or once a month with a lot of those we've been talking to. Yeah. We have run around 400 microservices now across three different clouds. So that's also the multi-cloud aspect and the multi-cluster aspect of it. The scale and the growth. So as you can see, we sort of had like an exponential growth from 2019, both in terms of customers but also in terms of employees at the company. We went from I think around 150 employees in 19 to almost 700 employees now. So it's been growing pretty quickly. We adopted Kubernetes back in 17. So we've been running this for quite a while and sort of been evolving as the community has been involved and looking all the different projects and things that we could use from the CNCF. Also a couple of acquisitions, which we also going to touch a little bit about because that's also part of this multi-cloud strategy that we are sort of working with. But let's talk a little about how we actually sort of started out our evaluation story of a service mesh, how it actually did this. It's always been complexity compared to the value we were sort of getting. Initially, as you saw on the graph, we weren't that many people, especially not in sort of the platform area of the company. We were one or two people. So complexity was really something that we needed to ensure that we were able to sort of handle that as we sort of grow. So it took a while before the sort of the feature set was there for us. And what really tipped the scale was the multi-class aspects. And I'll come back to that a little bit later why that is. But we basically evaluated service mesh for three and a half years before we actually adopted it. We tried out the first Lingardee one, Lingardee one with the node proxy, setup daemons, set deployments and everything. When Istio came out, we tried that out. It was a bit too complex for us at that time. Then Lingardee sort of changed into conduit, tried that out. We weren't really ready yet. We were still like one and a half people, as you can see at the bottom. Lingardee two, yeah, still not enough people to actually handle the complexity of actually adding this into our stack. And bear in mind these platform engineers were focusing on Kubernetes, observability, all the stuff that we have running. So it wasn't until they are 2020, we needed to, we were in the process of actually becoming a bank at that point. And there were some requirements in actually getting a banking license. So we were starting to looking at all the features again, having MTLS, all these things were sort of the thing that we now needed to look at again. And also this multi-cluster sort of thing was becoming sort of a topic for us. And I'll show you that in a second. So which one to choose? Well, for us, it was really about the simplicity of the mesh. And for us with the number of people, simplicity was really the thing. And then at this point, the Lingardee two proxy at least had like a lower sort of resource. So it was also like a little bit of a cost perspective to it, but it was really about simplicity for us. So multi-cluster. The reason for why we started out doing multi-cluster was that we were sort of having this set up where we had like different environments. We were replicating our observability stack. We had a lot of you know, stateful complex services that are pretty hard to sort of run. And that was pretty annoying to sort of have that replicated set up between all the different environments. So we wanted to do something else. And at this point, we were also sort of like moving towards GitOps. We were using and storing all things in a Git repository. And we were experimenting with treating our clusters as cattle to actually be able to you know, kill a cluster and spin up a new one and recreate the state based on what was in Git. So using Git as the source of truth. But that was a problem with this. And the problem was that our lock system, which was sort of it was for me, as you can see here, I'm not sure if you're familiar with it, but that doesn't really matter that much. But it used like an EBS volume. And the problem that we sort of saw was that when we created our failover cluster, we needed to recreate this data. So we were all, we were constantly taking snapshots of this EBS volume. But then at some point we needed to restore it and sort of hook it into the lock management solution in the failover cluster. And we were losing locks, which wasn't really a good thing for us because all audit locks, everything was sort of flowing into the system. So that wasn't really acceptable in order to do this. So we needed to find another way to actually sort of centralize our lock management because that was sort of what we wanted to do. If you want to know more about the failover cluster setup, my colleague Henrik did a keynote at last KubeCon in Valencia about the failover story and how we do that. I think I have a link down there. So yeah, if you want to know more about that, check that out. But what we really wanted to do was centralize and create like this platform type of cluster where we could store all kinds of different platform related things and use this cluster to make it easy for workload clusters to send lock data to a centralized place, to send metrics to a centralized place and so on and so forth. So what we did was actually starting to try this out with a service mesh and connecting clusters across AWS accounts by basically adding a service mesh, trying it out. We tried out both Istio and LinkedIn at this point and for us it took like an hour to get LinkedIn up and running, which was pretty amazing for us to just see, hey, now we are able to actually connect clusters across clouds. And so we actually just used that. But our initial adoption of service mesh was basically to do this multicluster setup where we sort of put Fluent Bit into the service mesh and the lock management solution into the service mesh. And from the point of Fluent Bit, as you can see up in the workload cluster, it's basically just a transparent communication because it's just rewriting to a normal Kubernetes service in this setup and then it flows through the multicluster control plane that I'll introduce in a second. But it was pretty nice. One thing that was also very important for us is that we needed to have one directional links. We didn't want to be able to go into this centralized platform cluster and be able to do request the other way around so you can actually change things in production from a centralized place. So we wanted to be only possible to ship data in one direction, which was also pretty easy to do. Another thing that we sort of been focusing a lot on at Lunar is platform engineering and really making it easy for our developers to adopt and be secure, be compliant, follow all the best practices. So our platform mission is really to empower our developers. That's more or less what's written here. Build everything in as much as possible. We do that by, we've created a couple of tools ourselves. Doesn't matter that much, but shuttle up here is open source. You can find it, but it's a way for us to sort of deal with centralized scripts, templating. We do a lot of different stuff that sort of make it easy for developers and we can provide compliance basically out of the box for them as a platform team. So they don't have to do that themselves. And I'll show you some examples in a second. The other tool that we need to mention here is release manager. And that's a CLI tool called hand control. It's not really like the meat. It's the ham was the first chimpanzee in space. It was a really bad name. I always have to explain it when I talk about it. So it's kind of annoying. But anyway, it's called hand control. And what it does is it's basically able to move artifacts from one place in a in a git repository to another place in a git repository. And I'll show you a short picture here. We start with shuttle. And what it basically does for us, it's, it's creating this abstraction on every service at lunar has the shuttle at YAML file that points to a centralized plan that we call it, which stores all this different configuration that we deliver as a platform team to, to the teams. We build a lot of abstractions around that, how to get a database, how to get us into the service mesh, how to all kinds of different things. And the other thing is on the other side here is this ham control. It's basically, as I mentioned, a way to move things around in this config repository so that we can actually move some Kubernetes manifest into a specific environment and then have flux as we use for reconciliation to actually apply this. All right. So that's sort of something I just needed to sort of highlight to, to, in order to, to understand how we then sort of implemented this with a service mesh. Um, and how does that really help us as a bank? Well, right now, uh, we, we get empty less out of the box, which is definitely a nice thing as a bank to actually get this out of the box, move towards zero trust networking, the beyond prod, uh, by Google research. So having that is, is really a, a thing that was really good to, to get, uh, another issue we were starting to see at this point was the GPC load balancing internally. If you're, if you try to do a GPC in Kubernetes, you might know that once you sort of connect to a part, um, all requests are sort of pinned to that part, which is quite annoying. Um, so having something in that are sort of able to, to help you distribute, uh, and actually do load balancing was a really nice feature to get as well. So it looks more like this to actually get a, a more even distribution of requests between the, the services that you have running. Um, another cool feature that we got from the Dengadee setup was what's called service profiles. Um, it's basically a nice way for our developers to sort of specify, uh, idempotency, retryable endpoints and paths, uh, so that the proxy layer can, can do this for them. And they don't have to do this in the, in the services and they can do specify this using the, the specs. They're already writing the sort of specifications in with protobox or open API as the example as here. Um, and just get the mesh to actually do this and handle this on the, on the network layout of the proxy layer instead of them doing it. Um, yeah, so it's just if request fails, retry by the proxy level instead of doing this in the service, that's also a nice feature to get. And then last feature that we are really excited about is in Linga, Linga D two 12. That's now this path based routing policies that allows us to, for example, in this example, say that Prometheus is only allowed to scrape the metrics endpoints. So being able to lock down and, and sort of ensure that only services that needs to talk to each other actually are able to talk to each other. Um, cool. Um, so that's sort of one of all the stuff that we sort of got out of the box by now adding a service mesh to our stack. Um, another thing that was very important for us in, in this was really, especially around the complexity and sort of adding a service mesh to our stack was incremental adoption and be able to, to do that. And that was also a thing that we built into this shuttle tool, so that we allowed our different teams to experiment and then just try it out and see how it, how it sort of went. So the first thing we did was to create like an alpha feature in our shuttle setup that allows developers to say, Hey, I want to deploy this into our development environment and see how it works. Um, and we let that run for a while and had people experiment and, and see if they sort of could find any errors. Um, then we moved on to like a more global thing that now all environments just are sort of being added into the mesh. Um, and then lastly, uh, we have enabled this by default. So now this is something you as a developer lunar get out of the box. You don't have to think about this anymore. It's just part of the abstractions and sort of the things that we provide as a platform team. And then multi cluster functionality. So I'm not sure if you're that familiar with how Lincoln D works and the multi cluster aspects of LinkedIn. So I just wanted to, to bring this in as well. It was pretty simple to do. You basically just have to install the multi cluster control plane, got that up and running in both of your clusters. And then you use the command or the CLI to actually just create a link. And you can do that the other way around as well, which is this example. So in this case, you will have a bi-directional link between your two clusters. And what you get is some components. You get a gateway and you get a service mirror and have another example that sort of highlights what these components do. So the service mirror basically listens for the API server on the opposing or sort of the opposite cluster listening for a specific annotations. The default one is that on a service of a Kubernetes service in one cluster, you could set the annotation mirror at lingardita.io exported true. You can of course control these and name them as you want. And then it's the service mirror just basically mirrors that over to the other cluster and creates that service, making it super easy and transparent for services on that cluster to basically call the service and traffic will just go through. I haven't created the gateway here, but there's a gateway in the middle here as well to actually sort of route the traffic into the service A on the other side. So that's sort of how it works very simply or simplicity. Then for us, as mentioned, we were starting out with this lock management between AWS accounts. But then at some point, our company and our founder decided to acquire another company. So we acquired a company called lindify and another company called pay like. And that sort of put us in an interesting sort of point because now we needed to actually connect these different environments. And the case of lindify was that they were running in Asia and we were running everything in AWS. They were also doing a lot of things around creating Azure app services. So we needed a way to connect Azure app services with stuff running a Kubernetes cluster in AWS. And at the same point, we're also building stuff around in DCP because we're moving our data stack there. So we also needed connectivity between those accounts as well. And we thought, Hey, we have multi cluster running already. It works super nice. We've been running that for a year. It's stable. Let's just use that to connect as between all these different cloud providers. But then we also came into to this issue. We didn't have everything running in a Kubernetes cluster. We had a lot of things running outside a Kubernetes cluster in Azure, for example, as app services. So how do we actually contact an app service from a Kubernetes service in AWS through a Kubernetes cluster and then out to the actual resource in Azure? And the other way around as well, because the Azure app service might need to be able to request something that runs in a Kubernetes cluster in AWS. So how to actually handle that use case? We sort of looked at, we couldn't really find a good solution in sort of the open sources. So we just built something ourselves using Envoy. We called it Backbone Gateway. It's basically just an abstraction over some configuration of Envoy. But it really provides us with a sort of a thing we can deploy in each cluster that makes connectivity in and out of a cluster in each cloud possible. And then we build a nice sort of abstraction using this shuttle tool that I was mentioning before so that it is again easy for our developers to actually, you know, say, I want to expose this resource for services running in this cluster. So in this case, we have egress so that this is the case where an AWS service needs to contact an Azure app service. You of course needs to instruct or sort of make this available. So what we do here is that we basically just generate some Envoy configuration and then we create a Kubernetes service in front with an annotation saying this needs to be mirrored into the AWS side. And then it's possible for service in AWS to actually call over through the proxy and out into the to Azure resource. And then the other way around here on why it really doesn't do much or it actually doesn't do anything. We just use the same sort of repository to create an abstraction around creating an internal ingress instead. And then, yeah, allow traffic to flow over to the other cluster. So again, using this tool that you already built and creating some nice abstractions for developers to actually make it pretty easy for them to get this multi-cloud capabilities in place. So the use case basically, yeah, my service needs to be able to request credit scores from a service running an Azure app services. How do I do that? I just showed you that. So it looks something like this. Super easy for them. They just needed to add this into a repository and then sort of the platform took care of it for him. The last thing we built around the backbone gateway setup was really, we wanted to monitor sort of the connectivity between the different clouds. What was the latency? So we built something we call the backbone gateway probe. So it's basically just a simple growth service that are probing another of these backbone gateway probes in another cluster and register and logs. What was the latency? And then we just use our tool to visualize this. And one of the interesting things that we sort of saw is the latency from, so we have AWS to GCP was 30 seconds, but from DCP to AWS, it was like 60 seconds on average, which was an interesting finding. But it's nice to have this. And now we sort of can monitor all the links. So we built that ourselves to basically just do that was pretty simple. And wrapping up, I think service mesh really sort of increases security reliability and scalability for us. But we needed to build a lot of stuff, sort of rabbit a lot of things around it because we also want to make this easy. We don't want our developers to do a lot of YAML and stuff like that. We want to sort of abstract all of that away. But now from the developer perspective, it's pretty simple for them. They don't really have to care about all of this. We take care of that for them. With the backbone gateway that I talked about and showed you before, we now have like this communication backbone between basically the three different clouds that we are present in right now. And we can expand that to anywhere that where we can put a Kubernetes cluster and add a LinkedIn control plane. So that's pretty nice. Yeah, and then this shuttle tool basically just allows us to take in all of these new technologies as a platform team experiment, see what works, what doesn't work, try out the different, I don't know, service measures and see what works and what doesn't and then decide and have them adopted. And the last thing that we sort of learned in this and I haven't talked that much about GitOps. But the way we sort of structured our Git repository was that each environments are represented by a directory in like this config repository. And having this tool that I talked about earlier, the Hem control tool basically just allows us to create a new folder for Azure and Dev or DCP Dev. And then we can basically just use this tool to release. So we get like a uniform way to basically release stuff into multiple different cloud providers, which is pretty nice. Of course, there are some services if you're using like a load balancer in DCP or an AWS or there are some differences, probably some different annotations based on what kind of load balancer using. But once that's sort of taken care of you, you really get this feeling of we can deploy this to anywhere using the same way of basically working with the stuff. And I think I actually went a bit fast. So I think that was it for me. So yeah, if we have time for questions. Yeah. Thank you so much, Casper. Big round of applause, please. And we do, we do have five minutes before the coffee break. So does anybody have any questions and I will run over with a microphone. Okay. Running over the microphone. What are your concerns with intra and inter mesh security, like between like east, west between pods? Like where does that fall into your guys like like design and thoughts around service mesh? So can you repeat again the question was? So what are your do you guys take any approaches to deal with like layer seven security in service mesh? So like doing like internal waft between pods that kind of thing. And no, not really. And so right now, all of this is basically done using, you know, a wide listing of IP addresses of very static old school kind of way of doing it, I guess. So so basically the link of the control plane when you run the link command, it generates a configuration. It generates the secret and the API token so that can talk to the API server of the other cluster. And then we just internally sort of allow communication across the different. Yeah, basically the different clusters. So that's that's how it works today. So you guys don't have any kind of like intrusion detection system or kind of like threat detection stuff going on? Yeah, we do, we do have a lot of logging and a lot of metrics and alerts going on and that really monitors this. We have a more dedicated security teams that sort of monitor this more closely. Okay, thank you. Sure. Anybody else? Oh, question over. Who put your hand up first? You win because you're closer. I apologize. Thanks. Great talk. And I'm like really happy that this has worked out so well for you, it seems in production. I'm kind of curious about you mentioned at the very beginning that scale of complexity versus features and so on. And we're sitting on that scale at the moment. And I'm kind of curious about having run with this in production for a few years. You mentioned a lot of things about observability. And that's what a lot of people say around the service meshes. Did you find that the service mesh has provided you an easier way to troubleshoot production or even QA issues because of that observability? Or does it cause more issues because it's really complex and hard to understand? Yeah, it definitely caused some interesting problems. Thoughts not being allowed or using some weird protocols of some sort in some of the services that we are running. We definitely see some issues. But for us, it's been really, I think the outweighs sort of all the features and the things we get outweighs some of this complexity. But it's been, we've been doing a lot of stuff around actually monitoring what is the proxy doing, configuring the control plane at some point when we added this to the entire fleet of services. We need to scale the control plane a little bit. And of course, had to spend some time and actually figure out what is going on now that we see this liquidity destination thing is just, you know, crash looping. So why is that? But I think we are now also more people. So now we have enough resources to actually sort of dedicate one or two people to actually know more about and get deeper into what the service mesh actually does. But it does add some complexity, definitely. Great talk. I just had a question about your adoption phase. You mentioned that it took a few years to adopt a service mesh. Was there any sort of a notable point of friction that prolonged that adoption? And was it like the complexity of the service meshes at the time? Or was it that they were too invasive? Yeah, I think when we tried out link at the one back in 17, we just, you know, put so we have the option to just enable LinkedIn on the LinkedIn proxy injector on a namespace level. And when we did that, everything just broke. Nothing worked. We needed to figure out what the hell is going on here. So that sort of, yeah, that was kind of why we stopped back then. Then I actually had like in, I think it was in 19 when my colleague tried out LinkedIn too, he actually did a presentation at a meetup at some point and showing how easy it was to do this incremental adoption of a mesh. But we wasn't really ready at that point still to, we were still like two people that were sort of handling everything around platform. So we didn't really feel that we would have enough people to actually sustain this. But yeah, I think try it out, do the incremental adoption and trying out things slowly and let it run for a while and validate that this is actually working on that. That was a really nice way to actually get the confidence in actually running this. Okay. I think we've got time for one last question. Okay. Good morning. In regards to your industry being highly regulated and going down the multicluster routes along with what I assume is sensitive data, did you have any frictions or any restrictions in regards to auditing or in regards to with that cross cluster communication and basically data where it's stored or that kind of thing? As it is right now, we haven't really had that any sort of audit things. We have a really sort of progressive CISO that are very techie and I think he's really good at explaining what is going on here in a language that they understand. And I think that's been very helpful. I'm not involved in that aspect. But now if we haven't really gotten anything yet, at least we'll see. Thank you. All right. Thank you so much, Casper. If anybody has some questions, I'm guessing you're going to be hanging around by the coffee pot to kind of definitely lift yourself up after that fantastic talk. So big round of applause again, please.