 I'm Jesse Newland, the Principal Site Reliability Engineer at GitHub and I'm really excited to be able to share a glimpse into GitHub's Kubernetes infrastructure with you today. So I'd like to start with a quick story. A story about a software project that never got off the ground. Around four years ago, in the fall of 2013, the GitHub Ops team all met in San Francisco. We just finished a big data center migration which had bought us a ton of time in our race to keep up with traffic, with user growth, and with DDoS attacks. As a team, we were really excited about this new found time. We were full of ideas about moving up a layer of abstraction. So at this team meeting, we started talking seriously about building a platform. A handful of engineers on the team had experienced building and running platform as a service offerings and had a design in mind that used containers instead of VMs. We sketched out some of the rough concepts and split up some follow-up tasks. And the team all went back to their home offices and started researching the problem space. We picked a name and started to get more serious about the work. But as we started to learn more about the state of container technology at the time, we began to get a better sense of the amount of work ahead of us. There was a lot to do and we started to realize it was going to take us quite a bit of time to deliver something that solved the problems in our sites. As we started comparing notes with others in the organization, we got feedback from a few senior engineers that was initially surprising. They encouraged us to wait, to do nothing. Let someone else build this, they said. The design is not bad. In fact, it's good, but it's worthless without execution. And we're not in the right spot to ship something like this at the moment. This project needs a team that has more experience with containerized workloads and the resources necessary to push the tech a bit further. This project would be better served by an organization with experience managing more long-running projects. We want to see this happen, they said, but we're not the right group of folks to do it, especially not right now. Instead, we need to work on leveling up our organization and our technology for storing Git repositories. We took that feedback to heart and agreed to call the project off. I was disappointed personally and so were a lot of others on the team. But in retrospect, I think this was some of the best technical advice I've ever had the privilege to receive. There were a lot of other things that GitHub needed to work on. We really weren't in the right place to execute on a project like this at the time. I'm glad that we focused our time and attention where we did. We really needed to grow in other ways. Now, I didn't tell you this story to make any sort of false claim to any idea of a platform or anything of the sort. In fact, quite the opposite. I told you this story to set up a sincere expression of gratitude to everyone that was making different decisions just a few months later, deciding to dedicate the time, the experience, the resources, and the emotional labor necessary to kick off the Kubernetes project. We're incredibly grateful for Kubernetes at GitHub and want to extend that gratitude to every single member of this incredible community. GitHub is standardizing on Kubernetes as our service runtime across our entire engineering organization and we hope to do that over the next few years. We're making decent progress towards this goal. Right now, 20% of the services we run are deployed on Kubernetes clusters inside our data centers, including the service that serves GitHub.com and API.github.com, the service you use when you open an issue or review a poll request on the Kubernetes repo. I'm excited to tell you more about the Kubernetes infrastructure that powers this one small part of the Kubernetes community today. The Ruby on Rails application that serves GitHub's web UI is also developed on GitHub and lives at github.com slash GitHub slash GitHub. That's a bit of a mouthful, so it's become known inside the organization colloquially as github.com, the website, which I find pretty amusing. But to our Kubernetes clusters, it has a different name. It's the GitHub production namespace. When we started the process of migrating github.com, the website to Kubernetes, we hadn't settled on the design for isolating workloads running on the same cluster, and our back was still in alpha, so we reserved a few clusters for the exclusive purpose of running this workload and a few other supporting services. We're getting more confident with our isolation design over time and are excited to revisit this soon to more densely pack our workloads. We also didn't have a lot of experience operating clusters at the beginning of this project. We didn't know how to break one, though. We had recently broken a cluster pretty badly while trying to roll out an API server or config change and broken another one when we tried an upgrade. To ensure we could hit our targets for the reliability of github.com, we decided to run its workload on multiple clusters in each site. We didn't end up using any of the existing federation tools to do this, but instead rely on our deployment tooling support for deploying to multiple partitions in parallel. This multi-cluster approach has been pretty useful recently as it's allowed us to replace entire clusters without interrupting the flow of traffic or the flow of work. The three clusters running the .com workload in our primary site provide around 1500 CPUs and five terabytes worth of RAM each. We're currently running about seven different types of nodes in these clusters, so they vary pretty quickly in terms of node count. Running the same workload on a wide variety of node types has helped our data center teams evaluate the relative performance and efficiency of a bunch of different nodes. We've recently been preferring these tiny little things in a configuration that gives us 40 CPU threads, 128 gigs of RAM, and two local SSDs for each node. We packed tons of these into each rack and used node annotations to ensure that the Kubernetes scheduler automatically distributes work across these racks. Over the past few years, GitHub has invested a ton of resources into our data center management practices and processes. This investment has paid off pretty well. We built out some incredible facilities and have shipped provisioning APIs and automation that resemble what you might expect from a public cloud provider's VM control plane. Engineers can provision Kubernetes nodes from chat, but instead of getting back a VM, our metal cloud deals out one of these hooked up to our high-performance network. Node configuration is managed by our existing puppet infrastructure, which we use to build the images used in the provisioning process. Configuration changes and security updates can be applied with a pull request to our puppet repo, which can be deployed to an individual node, a cluster, or an entire fleet. So inside the GitHub production namespace on each cluster, there are three primary deployments, Unicorn, Unicorn API, and a thing called Console Service Router. Each pod in the Unicorn deployment contains three containers, first an IngenX container that accepts and buffers incoming requests, then a container that runs Unicorn, the Ruby on Rails web server, which processes requests sent over a domain socket from IngenX, makes requests to backing services like MySQL or Spokes, or Git storage tier, and then renders HTML. And finally, a proxy called Failbot that accepts, decorates, and buffers exceptions reported by the Rails app before sending them to an exception tracking system. The Unicorn API deployment powers API.github.com, and is practically identical to the Unicorn deployment. We deploy it separately due to some differences in seasonality. Computers wake up and go to sleep more frequently than we do, it turns out. So this third deployment that's running in the GitHub production namespace is Console Service Router, which routes traffic from application processes to services not running in our Kubernetes clusters, like our search clusters or our webhook delivery service. This uses a combination of HA proxy and console template. This service and this approach is basically a direct re-implementation of the way that we solved this problem before Kubernetes. We're hoping to replace this hand-rolled service mesh with Envoy sometime soon as a part of a project that aims to improve the observability of all services' service traffic. To accompany our Metal Cloud service, we built GLB, a horizontally scalable load balancer service that can route external or internal traffic to a group of nodes. And we use GLB, along with node port services, to get traffic into our clusters, currently using every node in the cluster as a back end. Most of the node port services configured in GLB route to an ingress deployment, but a few skip that step for various reasons. So in addition to configuring pretty standard logging, monitoring, and metric stacks on our clusters, we built a handful of tools to support ongoing cluster operations. There's Kube Test Lib, which continuously runs a suite of conformance tests to make sure our clusters meet our specification. Kube Health Proxy, which adjusts the weight of incoming traffic at the load balancer level and allows us to turn off an entire cluster. Kube namespace default, which creates default resources in each new namespace like limit ranges and configures image pull secrets. Kube Pod Control, which detects and deletes stuck pods and will set node conditions if it notices that an individual node has a bunch of pods that aren't able to start for some reason. And then Node Problem Healer, which detects node conditions that are either set by Kube Pod Patrol or Node Problem Detector and heals those nodes by rebooting them. A few of our teams have been building systems on Kubernetes for around a year, and it's been a joy to observe how their workflows have changed over that time. So all of these tools that I just mentioned, they're like real software projects with tests and stuff. They each have their own repo, which contains the software, a description of how it should be packaged, and a blueprint for running it in production. A year ago, we might have reached for a bash script to solve some of these problems, and to distribute that script to our class of nodes, we might have checked it into our puppet repo. I mean, I really, really like bash, but this workflow left a lot to be desired. Instead, we've ended up building projects like Node Problem Healer that use Kubernetes incredible APIs to solve problems. Node Problem Healer in particular has been a huge win for our team as it fixes a class of problems that sometimes happen in the middle of the night. But to be honest, I think it's also been a win because the NPH abbreviation makes us all think of Neil Patrick Harris more often, which, I mean, that cheers me up every time I see it. We've also built a tiny wrapper for Kube CTL Git that we can run in chat. It's been incredibly useful as we help a ton of engineers learn how to build systems using Kubernetes primitives. But these chat ops are read-only. We generally encourage engineers to use our deployment tools to declaratively manage the Kubernetes resources that describe their infrastructure as much as possible. So as opposed to a traditional version release strategy, GitHub has historically used a continuous delivery approach called the GitHub flow, where we deploy approved but open pull requests, and then merge them when the behavior has been verified. To map this sort of workflow to Kubernetes, we settled on a basic set of conventions, which I'll try to describe with some pseudocode. First, if a repo contains a Docker file, we'll build that image on each push, tag it with the name of the service and the current Git commit, and then push it to our internal registry. Second, each service and deploy-target combination gets its own namespace, like GitHub production, for example. Next, we expect our deployment tooling to load and process Kubernetes YAML from config Kubernetes environment in each services repo. This step can modify resources, filter them, or inject additional ones. This step also updates image fields to use the image built for the commit being deployed, as well as some other actions that are configured with annotations. And then finally, we expect our deployment tooling to declaratively apply the modified resource config using the same semantics as kubectl apply. So what's really rad about this set of conventions combined is that they support changes to either the application or its Kubernetes configuration using the exact same workflow. So no matter what sort of change you're looking to make, the workflow looks the same. You create a branch, you add some commits, you push your changes up to GitHub and open a pull request. And on that PR page, you'll see that a CI job has built a container image. The next up is everyone's favorite part of the pull request process, code review. During the course of reviewing a pull request, it's often pretty helpful for both feature authors and reviewers to be able to preview the effect of a change, especially if that change targets UI elements. So to help with this, we support deploying to a review lab target. Each review lab gets a unique URL that any GitHub engineer can log into named after the branch that's being reviewed. During a review lab deployment, our deployment tools load the YAML and config Kubernetes review lab and make a few modifications. The image field is updated, a secret is injected from our internal encrypted secret store, and an ingress is injected to configure the domain. Our deployment tooling then creates a new namespace named after the branch you're deploying and applies the modified config into that namespace. This workflow brings up an entirely isolated front-end tier for each branch that needs to be reviewed, which is automatically cleaned up after 24 hours. So once your PR has been reviewed and approved, the next step is to deploy it to production. But deploying a change to a large system like GitHub is inherently risky, and we're always looking for ways to reduce that risk. We know the GitHub going down can have a significant impact on the software industry, right? Workflows are interrupted, release schedules are affected, all sorts of productivity is lost. So after looking really closely at all the factors that have contributed to outages in the past, we added another step to our production pipeline a few years back. The first step of a production deploy is now Canary deploy, which exposes your change to a small percentage of traffic to help you confirm that it doesn't materially regress the performance or increase the error rate of requests flowing through the system. To support this Canary workflow, our deployment tooling looks for deployments with a couple of special annotations. One declaring a deployment as Canaryable, and another selecting a label to be modified. During a Canary deployment, our deployment tooling clones any deployment marked as a Canary, sets the value of the label you've chosen to Canary, and then deploys just that cloned deployment. So in a normal state of operation, the flow to traffic to our Unicorn deployment looks something like this. A Unicorn service sends traffic to all pods labeled with service equals Unicorn. But during a Canary deployment, a single pod Unicorn Canary deployment is created from the new version of code. Within a technical service label, but a different role label. Since our service resource doesn't include the role label in its selector, traffic goes to the Canary pod at the same rate as any other pod. So once the deploying engineer is verified that the error rate and latency of the Canary deployment hasn't regressed, they then deploy their change to production. This fans out their changes to a bunch of different targets. Some of this app's workloads still run on metal, so it's updated in parallel with essentially our deployment tooling doing something like this to apply the configuration from config Kubernetes production. Now, this Canary deployment model isn't unique to GitHub, and to be honest, our implementation of it is pretty naive at the moment. But it's important to point out that previously, this deployment workflow was only available when deploying the GitHub GitHub service, not any of our other services. This is really important for us because now all of the other services deployed to our Kubernetes clusters can use this Canary workflow once we've implemented it using Kubernetes primitives. Network effects like this are by far the biggest benefit we've seen as a result of adopting Kubernetes. Adopting Kubernetes as a standard platform has made it easier for GitHub to build features that apply to all of our services, not just GitHub GitHub. And we're really happy about this and are excited to continue working to improve the experience that our engineers have deployed and running any service at GitHub. That's really important to us as it supports another goal we're working towards. By providing a first-class experience for newer and maybe smaller services, we're helping to encourage the decomposition of our Ruby on Rails monolith, which is another important goal for our engineering organization. So I'll leave you today with a quick word about GitHub's plans for Kubernetes in the new year. In the coming year, we plan to focus a lot of our energy on enablement with Kubernetes internally. That is, we're going to be continuously evaluating the needs of our engineering teams and iterating on our container orchestration platform to meet those needs. One request we've heard from a handful of engineers is for persistent volume support in our on-prem clusters. State is pretty important for GitHub, right? Storing your data is basically the core of our product, but right now our Kubernetes clusters can only run stateless workloads. A solution to this problem is slightly complicated for us by our Metal Cloud service lack of a service like Google Cloud Persistent Disk or EBS. But even if we had such a service, network storage won't give us the performance we need for some of our systems, especially Git. Many of our systems are designed to use the SSDs available to each node to locally store a small part of a replicated distributed system. The local storage management proposal includes a use case covering persistent local storage that strongly matches some of the systems that we're running today. Spokes, our distributed system that stores and replicates Git repositories, looks like a great fit for this use case, as does part of our MySQL infrastructure, especially the fleet of replicas that we use to scale out read traffic. We're excited to begin experimenting with local storage soon. And we're also excited to break ourselves out of some open-source habits, out of some bad open-source habits in the new year. None of the projects I've described in this talk are open-source yet. And to be honest, I'm incredibly embarrassed to be up here and tell you that. I'm super excited that we're going to be able to commit some time to both open-sourcing these existing tools, and to paving a path that will enable us to build future tools like this in the open by default. It's the least we can do to give back to the most kind, welcoming, and talented community I've ever had the privilege of considering myself a part of. So thank you so much for letting me share a bit of GitHub's story with you this morning. And if you're interested in chatting more about anything that I just mentioned, please do reach out or find me at the GitHub booth later today. Thanks again.