 Hi, thank you for coming to my talk. I'm Mike's plane and we're going to be talking about building managing and automating clusters at scale with Pratt So a little about me real quick. I can be found online and at Mike's plane I work as a cloud infrastructure lead at Sonos previously I've worked at Barkley and PayPal where my container and Kubernetes work first began I've been using Kubernetes before the 1.0 days in various forms such as the good old Kubernetes mesos project and multiple homegrown solutions Eventually this led to working with cops and becoming a cops maintainer I also founded the Boston Kubernetes meta and hope to continue that when this is all over So at Sonos, I want to talk a little bit about our journey into Kubernetes. So let's see how we got here Application teams at Sonos build manage and deploy their services in production These teams often deploy multiple times per day now with over 29 million products around the world This leads to some very interesting cloud infrastructure Unfortunately over the years our teams have had to duplicate each other's code Which we really want to fix and speed up deployed times as well and that's where Kubernetes comes in So I often get this question. So I start by building a cluster, right? Not quite. Now, let me explain There's quite a few things to consider when you're building your Kubernetes clusters Let's start with cider blocks and networking to other resources You may need to define them based on your company and your culture's setup So you may need to make a lot of adjustments and changes there DNS is pretty critical for any service that wants to be externally accessible as well as security configuration for users and automation access You may want to enable alpha and beta at Kubernetes API features as well That requires certain flags that you'll need to understand and version as well Many times to have a successful cluster you require things like cluster monitoring logging and alerting Autoscaling and node removal is pretty critical in my mind for Kubernetes That way my team doesn't have to manage Individual instances and we can focus on managing the clusters and the automation behind it The problems that I've heard running people run into when they build clusters first is clusters become their source of truth Rather than putting it in code like a regular developer would we try to focus on putting everything in the cluster? Well, that doesn't quite work Cloud providers often sometimes also become the source of truth when you use tags or certain automation that relies on their history You really then can't repeat that through the rest of your environments This relies on humans to decouple and manage the infrastructure, which generally doesn't scale very well So let's get back to the basics. Here are our requirements when we went to go build our Kubernetes clusters First of all, we wanted everything to live and get all the configuration can be checked in managed via source control and pull requests We wanted everything to be idempotent. You can run it as many times as you need to to make sure that everything is up to date Almost every cluster should be as Identical as possible. There will obviously be differences between some clusters here and there Which allows us to try out things like different networking providers or other cluster resources However, they need to be configured and checked in the exact same way That leads me to the configuration We really want to be able to check in and manage individual configurations of clusters as we need to in general We should be able to abstract this across multiple clusters of the same type say production clusters versus test clusters. However There are many times where you're going to want to automate and test specific configurations on specific clusters over time We really wanted to embrace fast iteration much like modern application teams dev ops needs that as well And our dev dev ops teams here at Sonos really focus on building Kubernetes in a way that we can move quickly First we wanted to work our scripts to work the same locally as they work in CI That way we can always come back locally and work to solve a problem that we may run into in the cloud All of our tools are version and All of our changes are deployed with code even API changes that may be difficult to manage We try to write scripts and code to manage their upgrades across our environment environments So let's talk about our tools First we have cops we did consider many other options at the time However cops has a large community and is multi-cloud. So it was a pretty easy pick for us We then went on to Terraform to build all the rest of our cloud resources things like route 53 zones s3 buckets I am rules for things like Loki and Thanos that you may want in your cluster Complex networking changes and other custom tooling things that we might want to add on the fly Maybe we want to be able to schedule actions on our auto scaling group so that all of them spin down on the weekends Things that we don't really have to over engineer and we know how to do it in terraform So we can easily just add that whenever we would like The next bit we added is helm file and helm which drive a lot of our cluster components and how they run Helm is used to deploy to the clusters while helm file describes how we deploy to those clusters This really gives us the flexibility to decouple in the future and deploy individual components as we see fit We can quickly implement workarounds for things like API changes or breaking helm chart versions That would be difficult unless we had this automation in place We also use yq and a bunch of custom scripts to help manage our different values that drive our clusters These values grab specific bootstrap components that allow us to figure out which components we should run in what cluster On top of that, all of these tools are versioned This allows us to upgrade individual components on individual clusters as we want Tools are pulled down and executed dynamically within the repo, similar to tools like ASDF More on that in a little bit So let's talk about bringing up a cluster This is what bringing up a cluster at Sonos looks like On the left you see where we run the source command This sets up our environment, AWS credentials and secrets On the right you see the actual workflow that is involved in bringing up and managing our clusters First we run make cluster, which you could break down into smaller pieces like make cluster up or make helm file sync But one command will also suffice First it templates and manages the configuration Then we spin up the cluster using cops create or cops update depending on the type of workload Then we terraform apply to add our additional changes to the cluster Once the cluster is up and running, we apply our helm file changes using helm file sync Now the cool thing with all this is, if a user doesn't have a copy of a specific version tool We will go download that tool and cache it locally when the tool is needed This way we ensure every human or CI user has the exact tool versions they need This is very useful when utilizing tools like cops or terraform that can contain significant changes between versions This can also easily be cached in CI between runs This is what it looks like in code So simply you declare which cluster you would like to operate on You run the source command, that way we get our credentials and everything And then we run the make cluster command This is what that same command looks like when we run it in CI We'll get back to this a second if you're not familiar, but you may notice this looks much like a Kubernetes pod spec If you look in the center there, you'll see the command make cluster Now you may say, what about date 2? What happens after that? You have to worry about updates and upgrades Things like cops, Kubernetes itself, terraform, Helm, Helm charts, Kube CTO That's a lot of things to manage, right? What about adding new features? Things like metrics, logging, and security How are we going to test all this and make sure all this stuff works? Are you feeling overwhelmed? Because we were too The best thing we figured we could do is add as much structure to this as humanly possible That's where GitOps comes in It allows us to drive automation through pull requests and easily manage many small changes over time We can run thousands of tests to make sure our code works to the best of our ability That's where Prow comes in Prow is the Kubernetes CI system built and maintained by the Kubernetes SIG testing team Built on similar principles to Kubernetes itself Now this is a screenshot of what Prow looks like This is a simple cluster I spun up locally, but you can also see that at prow.kates.io Up top you see the jobs that have run recently and the length of time they've taken to run And down below you can click into specific jobs if you're not familiar Let's dive a little bit into Prow itself Prow has job execution very similar to Kubernetes It allows you to define pod specs that it adds specific components to to make it much easier to automate It allows you to have plugins and slash bot commands in PRs which makes it really easy to add additional tooling whenever you see fit Everything for Prow lives in source control You can automate your Prow updates as well as the jobs themselves via changes in source control We really also like that you can have jobs defined inside the Prow repo configuration itself As well as in the application refos That allows us to configure jobs that can make sure those other jobs that live outside the Prow repo itself are properly defined We also find that Prow is very fast and scalable This allows us to easily upgrade Prow whenever we see fit And it runs in high availability which is very useful in the cloud these days as everyone would know We also really appreciate the merge management feature which allows multiple pull requests to be approved at the same time Tested together and merged all at once Below you can see a couple links to Prow itself and the GitHub link to the Prow section under the test input repo Now let's talk about the types of jobs that you can run in Prow The main types of jobs you can run are pre-submits, post-submits, and periodics Pre-submits run on a pull request When that pull request is open those pre-submits get kicked off You can also define that certain pre-submits are required before the PR can merge versus not Post-submits run on a branch or after a merge In general you'll define the branch that you want those post-submits to run on And they will run as soon as code is pushed to those branches in GitHub Periodics are just what they sound like They're periodic jobs that run on a cron much like cron jobs So let's take a look at a pre-submit example This is an example that we took from the test input repo But we use it to test our YAML as well You can simply see here it is just set to always run And it pulls down a YAML-Lint tool Then we define a config for that YAML-Lint And we pass in a couple values files as well as folders for it to check This is another example of a pre-submit job This one's a little bit more complex In this case we have a script that we use to build and test Our cluster configuration in a kind cluster If you're not familiar with kind I would check it out It's very great, it stands for Kubernetes and Docker And we use it very heavily to test our cluster configurations This allows us to quickly build and deploy clusters And their configurations and get test results Before waiting for a full EC2 cluster to spin up Other pre-submit jobs that we run We lint all of our Helm files We lint all of our config files themselves We also run a Helm file upgrade test Which at the end of it it'll actually output a diff Of what the difference is between the before and the after of the change have That way we can easily see what the changes are We also have Terraform validation And we can really add jobs very quickly whenever we want Now pull requests are tested in about 5 to 10 minutes And once they're approved they're batch merged using Tide Which is a proud component Now we're not going to get into some of the components But I'll give you some links at the very end of my talk Pointing to some great talks that we've had at previous Kubecons Now let's take a look at a post submit job You can see it has a max concurrency of one Which means it'll wait for the first job to finish You also can see that there's a run of changed This allows us to check a given folder Where a specific file has changed and only run in those cases You can see it also runs on our develop branch And runs the exact same commands as before This is an example job that we run whenever we merge code Into our Kubernetes automation repo from our From any branch into develop Now you're probably saying that's great and all But how does that scale? Do you really want to upgrade all clusters when you're merging to a branch? Wouldn't you want to gate it in some way? Let's talk about a real world example This pattern led us down the path of separating Our configuration from our automation code So let me explain First of all we have a repo for automation and tooling of our scripts This follows GitFlow has releases with a develop branch For development and a main branch representing our major releases We also have a repo for our cluster configs That can be managed with a separate cadence That main branch represents what is running in our clusters Now those cluster configurations now contain a GitShaw That represents the code from our automation repo That is running in the cluster This allows us to manually deploy specific code And specific branches for testing It also allows us to hold off risky changes From reaching production and specific environments Until our teams are ready Similar to earlier this is what one of our Post-Submit proud jobs looks like First you can see that we have a run if changed here For specific files under a specific cluster This allows us to trigger jobs only based on Values changes and Shaw file changes Over here you see the branch, our main branch That this would run on Of course we'll need somewhere to send Any of our errors in Slack when they may occur Now down here you'll see that we have Our magic deploy hash This file path is passed into our configuration job This job will read that file It will then take the hash stored in that file And check out our automation repo Then as normal we will run make cluster And then we will also rolling update our cluster If it needs it So that sounds great and all But there's one more piece to this puzzle Let's talk about the auto bumper The auto bumper takes care of the pesky updates To the GitShaws in our config repo And will open PRs to update the cluster From time to time Let's take a look Now if you're familiar with the test Input repo under the Kubernetes org You may have seen jobs like this This runs a similar auto bumper to what we run The auto bumper runs to update Specific values in specific files So we extended the auto bumper to look For successful crowd job deploys to our clusters Once we have a successful deploy We open a PR with a new GitShaw If it's changed These are the jobs associated with our Config repo's pull requests These first three jobs run Helm-Lint On our configuration Validate our terraform And YAML-Lint our values To make sure everything is set Before we upgrade a cluster These are all defined in what's called In repo config In a .proud.yaml file In the future we hope to add additional jobs That will run cops update Without applying changes And helm upgrades as a dry run That way we can ensure those changes Are prepped and ready to go as well This job is defined in our Centralized proud repo And we'll confirm that our .proud.yaml File in this config repo Is formatted and configured properly Sometimes if you misconfigure a job This may be the only job that successfully runs Which will give you a heads up That you may have a typo And broken something Finally we have Tide which is a proud component Tide will wait for LGTM And approved labels from your teammates This can be useful to let Anyone in your org LGTM a release If they feel it benefits them But must be approved by a defined Approver from the owner's file In order to auto merge If changes happen in the upstream Branch between the time this branch Was created and merged Tide will ensure your tests are rerun With the post merge config Prior to actually merging These auto buffer jobs run as Periodics a few times a day Those jobs look for other Successful Kubernetes cluster Deploys Once they are successful PRs are opened And Git shots are updated Automatically Humans will go Take a look at those pull requests Take a look at the diff Of the Kubernetes automation code That has changed between the version updates Then we will approve Those pull requests The merge will occur And at that time the cluster will Be updated automatically This workflow has worked out really well For our teams at Sonos So let's talk about some of our future goals First we want to improve And contribute our auto buffer changes Upstream into Testimpro They apply to a Various number of use cases And we think that other teams could We also want to use Test Grid To view our test results And deploy configurations through our clusters We also want additional Slack integrations Such as announcing before Deploys are going to occur We also want to Set up a job template That way we can automatically generate All the proud jobs we need for different use cases We also want Better contents of our pull requests That way it's really easy To review changes that automation drives Now here are a few resources That I think could be useful for you The first is linked to the publicly accessible Proud dashboard for Kubernetes Next you'll find the code Test Info repo where you'll find the actual Proud code Third, I've open sourced a tool That will make it really easy to Spin up a Proud cluster using kind Locally Finally there are some really great talks That have been given at previous Kubecons I encourage you to take a look Thank you for your time Specifically at Sonos I'd like to thank Our team, Scott Michala, David Muckel Dan Miller-Medson, and Chris Solevy Thank you for having me And coming to my talk Now let's take some questions