 Welcome. My name is Christian, and together with my colleague Lutz, I will talk about how we solve compliance by adopting a linker D2 and what we had to do for day two operations at Finlib Connect. This talk is organized around the idea of the challenges we faced and how we solved them. So here's the first challenge. The challenge involved how to encrypt all traffic between applications in our cluster in no time and with zero developer overhead. To understand why we had this particular challenge of doing it with no time and zero developer overhead, it's probably important to understand where it all started. It all started in 2018 when Lutz and me, together with some other friends of ours, joined the company, Finlib Connect, to rebuild the present internal cloud platform. The platform that was there before was rather traditional, open-stack-based, and ran at capacity, so it had to be rebuilt, it had to be modernized, these kind of things. The new platform was meant to be based completely on Kubernetes, all containerized. It was supposed to be a private cloud that ran on bare metal in our own data centers. So on the right side, you see roughly the setup. It's one large logical production cluster spanning three data centers in Frankfurt. The specifics of this and why it had to be is probably subject of a different talk, but in short, Finlib Connect holds a payment institutional license, so we are from the financial services industry, and hence have to fulfill requirements that are pretty strict regarding security and compliance, and as you may have guessed, especially towards the encryption of traffic between services that run in our cluster. To make it worse, this whole project faced several challenges, including a very tight timeline of only five to six months to implement the solution. We had some time to plan it beforehand. We had a cloud team of around about five people. Some were only able to join a bit later, so it was, we started with two people, and in the end, we ended up with being six people. And the new platform before going live had to pass compliance checks from customers and regulators alike. And last, but not least, the development teams in the company had to do a lot of things and wrap their heads around a lot of new concepts as we were together with implementing the new platform, also making the transition to a fully cloud native setup that involved like some services that had to be rewritten, changes of databases, introduction of Helm charts, these kind of things. So that were the challenges that we faced in implementing this whole solution. Now, when it came to the particular bit for this talk, the Road to Service Mesh and the challenge of encrypting all of the traffic in the cluster, we basically needed this before the go live. And we sat down and discussing this matter, looking at what we already had in the infrastructure in the current application landscape. And so option A was, as some services did it already, to build it into each service, a full MTLS encryption between the services. We discussed this, read up a bit about it, and then quickly realized this is a no go as the threat from having to maintain all these different implementations in different languages, and us all just being humans, and by that, actually making potential errors in the implementation was too high a risk from that perspective. Also the time perspective simply didn't permit it. So option A was off the table quickly. Secondly, we came up with the idea to use HA proxy or engine X as a reverse proxy configured for MTLS based on an internal PKI. And basically the minute we put this on the table and thought about it for another minute, we realized that probably somebody on the internet had already done this and probably even better than what we could achieve potentially in the left time. So that left that was option C. And without really knowing it, we were looking for the service mesh landscape on the internet. Doing so back in 2018 quickly brought up Istio as it had hit version 1.0 in July 2018. Looking at the webpage of Istio quickly showed a very rich feature toolkit that was quite compelling to us, including MTLS. And it also appeared to be already been implemented by a good number of organizations which also spoke for the project. So we went ahead and installed Istio into our cluster. This worked, it was not as simple as we might have thought in the beginning, but it required furthermore it required to provide a specific configuration for each service. And that was something where when we looked at what it actually meant, we thought, okay, this is a bit too tough for developers that already had to do a lot of things. So we had to sustain the learning curve for everybody and also make it somewhat doable. So we decided, okay, this is not going to work. Sadly, we can't use the solution. And that's when we looked further and found another project called Conduit. It was supposed to be a super lightweight and quite simple to install service mesh. And we tried it and indeed it just worked. Sadly, we quickly discovered it did not yet support MTLS. So we were kind of back to the beginning. And at this point in time, we said, okay, we need to talk to some people in the community as this is usually what works best and especially in the awesome cloud native community. So we joined the Slack channel of the Conduit team and quickly discovered an excellent community with excellent support for every question you had. And especially that MTLS was promised to be one of the next items on the roadmap. So that combined, we said, okay, let's take our bet on the Conduit project to eventually deliver this in time. The timeframe looked good from our perspective. So we went on with our other business, like adding new databases, bringing up the Kubernetes cluster, testing things, et cetera. And in the meanwhile, eventually, Conduit was rebranded to LinkerD2 as it was in fact the new implementation of the LinkerD team coming from a LinkerD1. And also in Q1 2019, all the MTLS was added as an experimental feature to Conduit. So we could already experiment with it and found it to be working quite nicely for our use case. So as we closed in on our Go Live, which was scheduled for April the 7th, which was a Sunday, we did some further testing and we rolled out our first production environment on the cluster as a test, seeing how it would perform, et cetera. And we found sadly that the LinkerD control plane had a resource request shortage for our use case with the amount of pots this particular environment had. So we found that this was not configurable by ourselves as the option simply was not available in the LinkerD a tool chain. And so we moved out and reached out to Boyant, the company behind LinkerD and asked them about some help and we have to be really grateful that they very, very quickly pushed out a new version and published that on Saturday that we updated, which also went very smooth. And so finally we were able to Go Live as planned on April the 7th with this new version that had fixed this resource shortage shortly after our Go Live, then also MTLS turned into a general available state with LinkerD 2.3 on April the 16th. So long story short, after this intense five to six month period, LinkerD 2 basically helped us to solve this large challenge of implementing a full auto MTLS between all services in the cluster without actually having to involve the development teams. There was one minor thing where they had to set some configurations for a particular service, but that was just to tweak and tune something. So that was really a great success in that end. So we went on migrating our customer environments onto the cluster. And here it's probably important to mention that our cluster is a multi-tenant setup where a tenant is an internal team or business unit. So we are not offering this externally, it's all in the same trust zone if you want, but we have to separate these environments. So the first thing that at scaling up, we found was the question, how to scale LinkerD to 5,000 pots? Is this actually an issue or does this just work? And we can conclude that for the most part, it just works. Naturally, you have to increase and set some resource requests and limits appropriately, which we did. So this is our current configuration, running a LinkerD 2.9 with almost 5,000 pots. That works just nicely, except for one thing and that is the, that bugged us lately and that is the promissuous setup that comes with LinkerD. As you can see on the right, it's quite heavy on both CPU and memory. And also it only has a retention period of six hours. So basically these values became a challenge in the fact that we wanted to make use of these metrics naturally and we did, but it was only always limited and it was always a bit hard to work with those values. So we ended up in taking our tenancy concept and the fact that we had some work to do here to turn this into a proper monitoring solution that I'm going to explain now. It's based around the idea to slice and dice the LinkerD metrics alongside our tenant deployments. For that to understand, the first thing we did is to disable the LinkerD Prometheus. You can by now also do this just in the helm chart when deploying LinkerD. So LinkerD itself has no Prometheus. And then our tenancy system works like that. Each tenant has a dash system namespace here in the middle and then the respective workload namespaces belonging to that particular tenant. And as we follow basically a concept of eating our own dog food in the cloud team, also the cloud staff is implemented as a tenant. And as you can see here below the Affinity Cloud System namespace contains a Prometheus deployment that scrapes the LinkerD control plane metrics together with all the other Kubernetes and infrastructure metrics and then offloads them to Athanas deployment for long-term metric storage. And we also of course use the Prometheus then for alerting based on these metrics. The tenant system namespace on the other hand has an additional Prometheus with a one-day retention time also only used for alerting and also Athanas sidecar for offloading. And it scrapes all of the LinkerD proxy metric endpoints from the tenant's workload namespaces by that reducing the overall amount of metrics in each Prometheus for each tenant. And optionally these teams are able to federate with a Prometheus proxy which only exposes some selected LinkerD control plane and also general Kubernetes control plane metrics to the tenants if they are interested by that also further reducing the required amount of metrics and resources per Prometheus per tenant. So with this relatively complex system we are now able to sustain a proper monitoring setup and also to automatically deliver the monitoring dashboarding and basically the four golden signals that LinkerD as a service mesh provides out of the box to each tenant and thus to each team and each service right out of the box when they deploy their service onto our cluster. And with that I conclude the first part of the talk and hand it over to Lutz who is going to talk about the ops challenge part two. Thank you, Christian. As Christian told you, LinkerD has automatic TLS out of the box but that involves setting up quite a number of certificates that have to provide certain characteristics and we had some problems in achieving those. What is great about the setup is that you get it all automatically when you install LinkerD either by the CLI command, LinkerD install or by deploying the appropriate help chart, the certificates and their private keys are stored as Kubernetes secrets and the actual certificates for the LinkerD proxies 5,000 in our case are rotated every 24 hours. And even though I'll show you in a moment how this evolved from an early state, it still works in the current version 2.10 completely automatically out of the box. If you don't do scary stuff and hard stuff like we do you don't have to mess around with the automation that we had to. It helps us support our strict GitOps because we run a Helm update for every component in complex cluster for every commit to our Git pipeline that results in about 50 to 30 Helm update runs on a given day. And as Christian said, we run large multi-tenant clusters that we also run for a long time, meaning that certificates will become invalid eventually and we choose not to take the easy path with setting the validity to multiple years, but rather rotate them in less time, ideally, than the one year validity period. But there are certain small details that make us hard on us because one, the Helm update is not really idempotent, every single Helm update run will create a new set of certificates and private keys for each of the webhooks. And because they are self-signed, you cannot rotate them in a stable manner. Stable manner being that you create a new set of, a new pair of private key and certificate, distribute the root certificate to every possible client. And once all the certificates in a chain are updated, you then remove the old root certificate. And because this Helm creates a new self-signed certificate for each of the webhook, you cannot do proper injection with tools like the cert manager for the webhook certificates. This does work or did work in version two, seven already and was properly documented for the identity chain, but not for the webhooks. In LinkedIn, you have to separate or distinguish between two basic groups of certificates. One's the actual control plane that generates the proxy certificates for the automatic MTLS between all the pods and your applications. And the other is the initially four today, only three webhooks and API services that pose the Kubernetes API integration. So we set out to work with the LinkerD project to allow for a better integration with cert manager as well as being able to rotate all these for respectively five chains automatically in a stable manner. Let us quickly revisit how certificates were managed in version two dot seven. So it's easier to understand the changes that we helped introduce. First, you either use the Linker DCLI tool or the Helm chart to set up automatically the trust anchors. One is actually called the trust anchor, which then is used to sign the identity issuer certificate which forms the top part of the control plane, TLS search chain and the others other self sign certificates for the webhooks and API services that you see listed here. And these in turn are used by the identity issuer component of the control plane to issue and automatically rotate the certificates for each of the proxies. And the changes we introduced started initially with improving the compatibility with cert manager. Some of these were small things like improving or changing formats of certain Kubernetes secrets and others were the support for a separate CI cert for the webhooks as well as allowing tools external to the Helm chart to set the certificates within Kubernetes secrets. So that external tools could then access them for rotation. Now you create, if you choose to do this automatically with cert manager, you create two trust anchors, one for the control plane chain and one for the webhooks which are then used by the cert manager to generate the intermediate certificates which then still is used by the identity issuer to create the proxy certificates. And in Finlip Connect, we go one step further using a not yet released as open source vault operator that pulls secrets from Hashikov vault into our cluster because we believe in not granting or not making certain secrets and credentials visible to any human being. Thus, we have automated this last step as well to create the trust anchors within the cluster. And that basically allows us to do a full sweep of automation for the whole certificate management. With that, I would conclude and thank you for your attention. Yeah, thanks for your attention. You can find us on Twitter, here are our handles. And of course, we're also on the link at the Slack and the Kubernetes Slack. And we're looking forward to any questions you might have in the comments.