 Hello everybody. Thank you for joining. I want to start by asking how many of you run Kubernetes on at least one cloud provider? Pretty much everyone. How many of you run Kubernetes on two or more cloud providers? Probably like half the room. So we're here to talk about going multi-cloud. You've got one and you want another. For those of you expanding to multiple providers, I'm going to share some of our experiences on a recent project. And the goal is to give you an idea of what was required and what to watch out for. For those of you already on multiple clouds, I'm curious if you can relate to some of these experiences. If you have different issues run into or suggestions, I'd be interested to chat after the talk. My name's Nico. I'm a SRE on Grafana's cloud platform team. If you're not familiar with Grafana, we're an open source observability platform. On top of the Grafana dashboarding system, we offer a number of cloud products. Things like hosted Grafana, Grafana cloud metrics, logs and traces, incident management, on-call tools, black box monitoring more. The platform team that I'm on manages the cloud infrastructure that all these products run on. So we're looking after the Kubernetes clusters, internal monitoring systems, CICD and other internal tools. And we're going to start by asking why you might opt for multi-cloud. Like why would any company consider doing this in the first place? We'll go through an overview of a recent cloud expansion project at Grafana. Walk through some of the hiccups in this project and look for the lessons we can learn. We'll also review what went well and at the end there'll be time for any questions. So let's start by asking why do this in the first place? There are a number of reasons, but I'll just touch on a few. An obvious one is you gain more regions, right? If you think of any cloud provider, they have like a list of available regions you can deploy to. And between multiple providers, there's lots of overlap, but there's a lot of unique locations as well. And we have a use case for this. At Grafana, we have a synthetic monitoring product, which allows customers to deploy like health probes at different locations around the world. Think things like ICMP or HDP health checks. And to support specific locations, we have to deploy this product across multiple providers. There's also vendor lock-in. If you run your internal services or customer-facing products across multiple clouds, you now have the freedom to kind of lift and shift where these workloads run based on things like cost, performance and stability. So you can imagine one provider offers you a better discount, one is performing better, or maybe you notice outages are more frequent in one provider. You can kind of shift away from them. And the last reason I'll mention is customer preference. And this is actually new to me, but it was one of the main motivations for our project. So our customers are mostly other businesses who won't help with their observability. And they can care about things like latency to their workloads. They want our products and regions close to them, data sovereignty. They want to ensure their data is stored in specific regions or countries or spend commit, where they've been offered discounts to spend a certain amount, like a minimum spend in a certain time period. So let's talk about the project. Around one year ago, we started to expand our cloud platform to AWS. I was fairly involved, and we'll be using this as the foundation for the rest of the talk. Until this point, we ran the majority of our services on GCP, on Google Cloud, with a small presence on other providers. You know, this means I'll be sharing the differences between these specific providers. However, I'm trying to keep it as generic as possible. So the lessons learned will kind of apply to any provider that you're running on or you want to expand to. And we're really starting from scratch. We have to create the AWS organizations and accounts, the networking resources, so things like VPCs, VPNs, the IM users and policies so that engineering staff had access to the cloud services their products relied on. The actual Kubernetes clusters, and a number of essential workloads. So I'm going to touch on a few of these items. Regarding VPNs, we connect all the clusters of the same status to each other. I think there's like dev clusters and product clusters. And this is completed using each provider's managed VPN service. So GCP Azure AWS. And there's a few reasons we do this. Like many of you, if we have frometheus running in the clusters, scraping all the cluster metrics, we also have like a global metamodering frometheus server servers, which scrape all those frometheus servers to make sure they're actually healthy and alive. We also have a global alert manager cluster, which spans multiple continents, and a number of other internal services, which communicate cross cluster. And all of those are running in private IP ranges, right? So we use the VPN to keep it on the private IP range. This will come up again when we talk about allocating IP addresses to our clusters. Why manage Kubernetes? I won't dig too deeply into this, but a few points are, it's simple to get started. It's easier to maintain. And it allows us to lean on the experience of experts. Now, these large cloud providers have been running thousands of clusters for a number of years. And it takes a lot of weight off our shoulders, as there's plenty on our plate as it is. I was reading last year's CNCF survey, and it turns out more than three quarters of companies use managed Kubernetes platforms. So I suspect this is something that many of you can rely on or relate to. Regarding the essential workloads, we want frometheus installed, scraping system and service metrics internally and sending alerts to that global alert manager. We want Grafana agent scraping those same metrics and remote writing them to a Grafana cloud metric store for long-term storage. We want a Grafana instance preloaded with a number of dashboards. We want flux, which we use for continuous deployment. It basically syncs objects from a gate repository into our Kubernetes clusters. We want an external secrets operator, which calls back to a central vault cluster and syncs secrets from vault into the Kubernetes namespaces. And it's not a complete list, but it covers most of the essentials. There's other things like Argo CD, Keta, a custom admin app, and a few others. And we really want all of this in place before the infrastructure team opens up the cluster internally and asks product teams to start deploying the products. That kind of concludes like the overview of the project. Now I'm going to walk you through it and talk about five lessons I learned from the experience. So you've been tasked to build cloud infrastructure on a new provider. Where do you start? Before we begin, we need clear requirements. Sounds pretty obvious. I thought I'd mentioned it anyway. We have to ask ourselves questions like, what workloads are we going to be creating? And what won't we? As I mentioned, we've been running on GCP since Grafana started, and we're not going to be taking everything over. We just really care about the Grafana cloud product in this case. We need to ask, do we avoid inter-provider dependencies? Yes, ideally as much as possible. We really don't want an outage in one provider impacting our services in another. I did mention that external secrets operator, which calls back to a vault cluster. So right now that vault cluster lives in GCP. If there was an outage or a network issue, the secrets wouldn't remain as they are, but we wouldn't be able to create new ones or refresh them. So maybe that's something we could improve in the future. And what's the expected scale? Are we planning on like five node, 10 node clusters? Or do we want the capacity to scale to hundreds or even thousands of nodes? It might be difficult to answer some of these questions early on, but it really helps set the scope of the project. And once we know our requirements, we can start research. We run the majority of our workloads on Kubernetes, so we start there. Start reading about Amazon EKS. Amazon's managed Kubernetes service. Begin by reading documentation, watching conference videos, reading blog posts, trying to learn, you know, what's new, what's different. Maybe open up the AWS console, spin up a cluster and just get familiar. And time passes and we start developing a plan and we're sitting there thinking great. We're ready to get started. We can build our first POC or development cluster. And almost immediately we run into our first issues. You know, as I mentioned earlier, before we can create a cluster, we're going to have to create the virtual networks. So let's take a look at that. When we look at VPCs, comparing GCP and AWS, it turns out that in GCP, VPCs are global resources, a lot of acronyms. What do I mean by this? I mean that you can have multiple subnets from multiple regions in a single VPC. And it turns out we actually do this. We have some use cases where we have a shared VPC with many clusters across many regions. When we look at AWS, VPCs are regional resources. So we can have multiple subnets, but they must all be from the same region. We're left to make a decision. Are we going to be creating a VPC per region or a VPC per cluster? But you can see that no matter what we decide, there's going to be like a slight drift between how we configure our cloud providers now. With the VPC in place, we need to assign it IP addresses. And in GCP, you can have a mix of private ranges. So what do I mean by private ranges? I'm referring to like the 10.0, 172.16, 192.68, right? Those three private ranges. And you can actually have a mix of them in a GPC VPC, GCP VPC, which we do. We create subnets from the 10.0 ranges for things like nodes, pods and services. And we create a subnet from the 172.16 for the GKE managed control plane. Well, when we look over AWS now, you can have multiple subnets, but they must all be from the same private range. So we're just going to use 10.0 for everything. Again, it's a small difference, but it means that we're creating a bit more drift again between our providers. And finally, in the networking side, regarding subnets, GCP supports up to a slash 8 in the largest subnet size, whereas AWS the largest is slash 16. It shouldn't be a problem. You know, that's over 65,000 addresses. But we've been deploying our pods and slash 14s on GCP. So when we go to AWS, it'll be slash 16. All these small differences meant that we had to refactor our IP reservation plan. Basically, you know, I mentioned we connect our clusters by VPN. So we need to ensure that there's no private range overlap. And we really didn't consider this, I think, when we first created that plan, when we ran solely on GCP. There's a few other things we can compare. If we look at load balancers, so the managed load balancers provided by each cloud provider, GCP offers global load balancers, which aren't tied to a region, whereas AWS the load balancers are all regional resources. If we look at volumes, so network attached volumes, persistent volumes, the prices and performance will differ slightly between provider. And then finally, object storage. You know, at Grafana Cloud, we have three products, Mamiir, Loki, and Tempo that use object storage as their primary data store. It means that we read and write a lot to the GCS and S3 buckets. And it turns out that these cloud providers actually rate limit access to object storage, but based on different conditions. So the product teams had to actually update their software to behave differently depending on which cloud provider it was running, which is not something I expected. So the first lesson, really for myself, is that cloud providers, cloud services are similar, but they're not the same. And it was a mistake thinking that we could only examine the differences between the managed Kubernetes services. We really let it take a step back and review each and every service that we relied on. And these are just a few of the examples we ran into. But I think it illustrates that no matter what provider you're on or where you're expanding to, you'll need to do the same. You'll need to review each and every service you use and accept that whatever plans you had, you'll likely have to make adjustments. So what does this all mean? It means that we have a lot of reading to do. Here's a fun phrase. Tutorial hell. Many of us have probably heard of this term before. It describes someone who's first learning a programming language and is stuck going through tutorial after tutorial after tutorial. Never really feeling confident enough to build something of their own. I'd like to propose a similar term. Documentation hell. When you're working on a large project that involves numerous services and systems across multiple providers, there's so much documentation to read that can genuinely feel overwhelming. You can find yourself stuck reading and planning, but never actually getting started. And this isn't entirely new. You know, those of us working with software spend a lot of time reading documentation. I personally struggle to remember all the bits of information I read in a day. So I take a lot of notes. But to be honest, it's really only when I start to use that new tool or new system that I feel familiar with it. And I'm fairly confident there's others who feel the same. So with all this in mind and the problems we ran into, let's get started again. But this time, I think we need to accept that our first plans are going to fail. We'll need to iterate probably a few times. I have a couple of recommendations around this. First, use infrastructure as code and version control from very early on in the project. So we use Terraform and Git. There's many options in this area. Initially, it's slower progress, right? You're going to have to actually figure out the Terraform resources and then write the code to create it. But it's worth the trade-off. You'll get benefits like peer review and suggestions, not only from the team working on the project, but others in the company who are familiar with these tools and services. It also allows you to share the live progress of the project. As everything is being submitted as pull requests, other teams kind of can see where you're at and what issues you're running into. And what I find to be a really valuable benefit of version control is that it documents a history. We can now go back and see when something was created, when it changed, and why it changed. And we more or less get these benefits for free just by using these two tools. It's also important to remember that no matter how much time we spend in the planning stage, at some point, we're going to hit an issue we didn't expect, some bug, and that's okay, like that can be expected. So the second lesson is if you find yourself stuck in Documentation L, it just gets started. Like often the best way to learn is by doing. And if you use some infrastructure as code tool, it's really easy to tear down, iterate, and rebuild. And the earlier you hit these limitations, the earlier you can plan around them. I mentioned in the beginning that we're starting from scratch. It's not entirely true. We've been running on GCP for a number of years, so it's really tempting to look over there and copy and paste everything into the new provider. And in most cases, that's what we'll do. But we should still look for possible improvements as we go through this. And I have one example to illustrate this. When we look at GCP, we had a problem to solve. Basically, some applications, pods, needed access to cloud services, like a database, or a GCS bucket. And the way we solve this is using GCS service accounts, which have a policy to that cloud resource, so a GCS bucket. We basically generate credentials for the service account, save them into Vault, that external secrets operator reads a secret and creates a Kubernetes secret that the pods can use. And this works. In summary, we're generating, saving, and rotating credentials. But it means they need to be managed, and it means they can be leaked. You know, with hundreds or thousands of credentials, you can see how this becomes cumbersome. Before I move on, I should mention, GCP does offer an alternative to this, which is managed identities. However, when we first created these GKE clusters, there was a limitation to them, so we opted to use the service account credentials directly. When we jump over to AWS, we face the same problem. We have applications, pods, that need access to cloud resources, like an S3 bucket. Well, AWS offers something very similar to GCP's managed identities. We need to create three types of resources, an OpenID Connect provider, an AWS IAM role, and a Kubernetes service account. And all they have to do is reference each other, right? So the IAM role has a policy for that S3 bucket, and a condition that says, only this Kubernetes service account can assume me. And you go to the service account, and it has an annotation that references the role. And with these three things, we solve the same problem. We can permit applications access to the cloud services they need, but we do so without generating and storing credentials, and by utilizing identity-based access. So the third lesson is don't blindly do what was done before. It's easy and beneficial to want to keep things as similar as possible. You know, the less drift between providers, the easier it is for product teams to onboard and start deploying their applications, and the easier it is for the infrastructure team to manage. However, it doesn't mean we shouldn't look for possible improvements. If our end goal is to use managed identities across all of our providers, which it is, then we can opt for the better solution and save ourselves some work in the future by doing it now. So jumping forward, we're all set up. We have that first EKS cluster up and running. We've installed a number of add-ons, things like the cluster, autoscaler, load balancer controller, and custom networking, which weren't provided by default. And all those essential workloads I mentioned earlier are deployed and running. We're really at an exciting stage in the project. It's time to invite internal product teams to deploy their products to AWS. So the first product team deploys their product, they get everything up and running, and they run load tests. The results are not great. We discovered a couple of problems, including poor performance of certain workloads. So what happened? The first problem we discovered was poor disc performance. It turns out that we were using the default EKS storage classes. What do I mean by this? I mean that we're using the previous generation volumes, so the IO1 and GPT2 volumes, and with IO1 volumes, you can set your IOPS between like 100 and 5000, and all the volumes, regardless of their size, were set to 100 IOPS by default. You can imagine how this could be a problem for any workloads with heavy read or write to persistent volumes. Personally, I'm a bit embarrassed by this one, but it's something we simply overlooked. It's fairly simple to fix. We install a EBSC side driver, and we update the storage classes to use the new generation volumes, and set the IOPS to scale based off the size of the volume. Another issue we discovered with the load tests is darker hub rate limits. So as you can imagine, when we start running these tests, we're scaling from tens of pods to hundreds of pods to handle that artificial load. And in our AWS infrastructure, all the traffic is egressing through NAT Gateway, which has a single IP address. So we found the pods are failing to start due to Docker hub rate limits. Why didn't we see this before? Well, it turns out GCP was providing us an image cache by default, and we just weren't aware of it. Again, it was fairly simple to fix. We installed a Docker registry mirror internally and access a pull-through cache, and we update all the node pools to utilize that internal mirror. This resolves the rate limit issue, and by caching images locally, pods start faster, and it reduces bandwidth from the cluster. So the fourth lesson is load test your applications. Grafana acquired K6 around one year ago, and we began adopting it internally, but there are other options in this area. It's really important to stretch your applications to the limit and discover issues before your customers do. Had we not run these load tests, it's very possible we'd be running applications with degraded performance, and we'd face scaling issues when it mattered most, when we had the scale actually due to customer demand. You get the added benefit that you get a benchmark the performance between different providers. In our example, the new provider didn't perform as well initially, but perhaps it actually would perform better. Well, that would have given us a reason to go back and examine how we're configuring things on the previous provider. So what else happens when we run load tests? The auto scaler needs to scale up nodes to support all these pods. Not a problem. You need more nodes. You open up the AWS console and click away. So I would have thought. Unfortunately, it's not that simple. We need to talk about quotas. When you're building on a new cloud provider, like you're creating new accounts, you start with very small default limits. And it's important to remember that not all regions are the same, right? These regions are supported by actual physical data centers somewhere in the world. And as consumers of public cloud services we're basically blind to the capacity of these regions. I have no idea if region A or region B has a larger capacity behind the scenes. It's usually not a problem, right? You submit a support ticket, ask for more quota, and the majority of the time this is resolved within minutes. However, sometimes it takes days and sometimes, believe it or not, they might just say no. If you've recently created an account on a new provider and are expecting to scale through hundreds or thousands of nodes right before December, you might discover that's not going to be possible and you may have to delay your plans. So the fifth lesson is pay attention to quotas. Know which quotas you'll need to increase and submit those increased requests early. Understand that you're at the mercy of a physical data center's capacity and monitor and alert on quotas. You don't want to find out that nodes are filling this start because we've hit a quota and nobody's open to support ticket yet. So to recap the five lessons, provider services are similar, but they're not the same. Understand the additional complexity you're kind of opting to go, opting to undertake when you go for multi-cloud. Just get started. Don't get stuck in documentation hell. You know, the earlier you get started, the earlier you discover problems and limitations. Don't blindly do what was done before. Look for possible improvements throughout the project, not only for your new environments, but the old environments as well. Low-test your applications, discover limits and issues before your customers do, and know your quotas. Save yourself some headache, pay attention to quotas, and understand that the cloud is not infinite. You know, that covers most of the issues I'm familiar with. So what went well? Somehow I've talked for 30 minutes at a KubeCon and barely talked about Kubernetes itself. There's a reason for that. It just worked. Once we had all the clusters up and running, we were able to use our standard tooling to deploy the majority of workloads without modifications. And I guess that's exactly what we're aiming for. This is the perk of using a standard interface for deploying and managing our applications. The majority of issues we face that I've discussed were related to like the underlying cloud infrastructure, not Kubernetes itself. There were a few exceptions to this. I mentioned managed load balancers, so we look at things like ingresses or services, which actually rely on these load balancers. We need to make adjustments for them. And any configuration which referred to object storage, again, we're going from like GCS to S3. We have to tweak the configuration. But I think this shows that when you're expanding to a new cloud provider, you know, the benefits of Kubernetes to deploy managed applications is that once it's set up, you're just able to do so. So thank you for your time and for the opportunity to speak here. I should also mention we're hiring for pretty much every team. So please check our careers page or come chat with me after and I'm happy to answer any questions.