 All right. Hello everyone. Bonjour. Good morning. Really nice to see everyone over here. I believe there's the first breakout right now after the keynote, right? So I'm hoping that all of you are excited for the rest of the conference as well. All right. First of all, thanks for thanks for joining in before we get started with the session. A quick, quick introduction about me. I'm Karan and I lead international developer relations at GitHub and I lead a team of DevRel professionals helping support ecosystems in India, Latin America and some of the other parts of the world as well. So I'm very much still a developer by heart and passionate about DevOps, infrastructure, AI and a whole lot of other things. And you know, this, this passion of mine has led me to actually build and maintain a couple of services at GitHub as well, which, which is being used by a lot of internal GitHub employees as well. Now, I know, you know, we'll be covering quite a few things over here, but the first thing that I really want to talk about in a very, very short way is, well, what is GitHub? I'm sure, you know, many of you might not need a really lengthy explanation because most probably you use it either as a part of your work or for your personal projects and other things, et cetera. But for those of you who aren't familiar as to what is GitHub or where GitHub is right now and what GitHub is really doing, then a really quick introduction is summed up in, in this graphic over here, which is that GitHub right now is, you know, a complete AI part developer platform used by more than a hundred million developers to collaboratively build, test, deploy software and really manage the entire software development workflow in hundreds of millions of repositories and a whole suite of products that can help you across your software lifecycle as well. Now, of course, you know, that's, that's the scale of GitHub and, and what it is meant for, you know, all of the users as well. Now, how many of you here are, are really working on the ops side of Kubernetes or, you know, really on the administration side? How many of you on that end? Okay, quite a few. How many of you are just interested in knowing, all right, what does happen behind the scenes at GitHub? Quite a few. All right, I'm assuming the rest of the them are probably mixed on multiple things. Now, of course, you know, this is of course the scale of GitHub and some of you might be interested in knowing what's the scale at which GitHub runs behind the scenes and, you know, the scale of GitHub as a platform itself. So let me give you, you know, a very small overview on what does this scale look like, you know, behind the scenes at, at GitHub. And I'm going, I'm going to share a little bit around what this looks like, especially from our CI CD side of things. All right. Now, if you were to just talk about GitHub as, you know, a platform and also talk about the code base, you can find a lot more about what, what GitHub's architecture and some of the other things looks like on the GitHub blog itself. But essentially there is, you know, a monolith and GitHub's engineers just for the monolith, keeping aside a whole lot of the microservices perform more than 20 plus deployments a day. Now you might be wondering, well, thousands of engineers, 20 plus deployments a day. Again, we explain a lot more about this on one of our blog as well, because these 20 plus deployments are for the main GitHub monolith and each deployment includes multiple PRs, which are deployed and merged using GitHub merge cues again. And what this means is that some of these deployments and the pipelines run almost 15,000 CI jobs in an hour. And I'm being very, very conservative over here because like how I mentioned, this is mainly for the monolith and includes a whole lot of other microservices that are there. And these CI jobs consume almost a staggering of 125,000 build minutes just in an hour for all of these CI jobs. And which is almost 150,000 cores of compute. So now you can imagine, well, you probably know what is the scale of GitHub as a platform. And just as one of the metrics on the CI CD side, this is the scale of GitHub, you know, on, on the development development side. Now, of course, this is, this is massive and there are a whole lot of microservices and GitHub, you know, is a huge code base as well with multiple different products, features and thousands of engineers working together. So this really, this scale really warrants, you know, to make GitHub what it is today, which brings me to, you know, to one of the things as to how is, how is some of this possible? What does this look like for one of, you know, a typical engineers? Or what does it look like on the platform side of things for us within GitHub? Now, to speak about that for some of the GitHub engineers to really efficiently code, build and, you know, ship a whole lot of software, you know, we provide our engineers with what we call as a paved path. Now, there might be similar terminologies that you could have heard from other organizations as well. But, you know, at GitHub, we call it as the paved path, which is again, you know, a lot of comprehensive suite of automatic tools, applications, you know, processes, runtime platforms, et cetera, which, which really help with the deployment, hosting and a whole lot of other things that, that GitHub engineers can use to run microservices, you know, both for the GitHub.com platform itself and also for many of the internal tools. So let me give you a little bit overview on what this paved path really looks like. Now, GitHub's main paved path covers everything that's needed for running software, be it, you know, creating, deploying, scaling, debugging applications and microservices. And also, it is more of an ecosystem of tools, which, you know, which includes, of course, Kubernetes, Docker, load balancers and a lot of custom apps as well, so that more cohesive experience can be provided for the engineers. Now, don't get mistaken when I, when I say that, you know, it's a paved path, it's not just infrastructure or it's not just Kubernetes. Kubernetes is, of course, our base layer. And the paved path is really a mix of some of the conventions, some of the tools and some of, you know, the config configuration settings and processes built on top of that. Now, speaking about Kubernetes itself, that's, of course, our base layer and it runs in a multi cluster and a multi region topology, a little bit more about that later. Now, why this? Why your paved path? How does it, how does it really, you know, help us apart from the benefits of using Kubernetes itself? Now, with our paved path based on Kubernetes, including Kubernetes and some of the other runtime apps, what we're able to do is that we're able to plan a lot of this capacity centrally across, of course, all of these services that are there since a lot of the workloads, smaller, larger workloads exist on the same machines, right? So the capacity planning has more, you know, centrally for us. And of course, because of the central capacity planning, we're able to scale rapidly as well, as and when needed, as and when there is, you know, more demand that's needed. And also we're able to consistently provide insights into app and deployment performance to the service owners, you know, almost at the level of what is, you know, the specific part doing what's a specific container doing, et cetera, more easily. And of course, because, because, you know, Kubernetes is our base layer, it's also becomes easy to manage all of the configuration and the deployments in a central control plane without having to really jump across various different places. Now, the kind of services that we typically run using, you know, the paved path include things like small web applications, computation pipelines, batch processors and, you know, monitoring systems, et cetera. So various of these services and kinds of services we run it, you know, using our paved path. Now, of course, you know, most of, you know, most of the really awesome stuff, there's, there's always a hero or there's always, you know, a rock star who really helps make all of these things happen, right? Now, at this point, I actually want to introduce one of our rock stars at GitHub. So he's, he's a very, very formidable, formidable force at GitHub and is something that when you say the name of that rock star, right, it's something that almost every GitHub employee will know about and, you know, because very, very fun as well, which is Hubot. All right. Now, so Hubot is, you know, an open source framework to write chatbots and a really standardized way to share scripts between, you know, everyone's robots, et cetera, for automating multiple things. So we at GitHub wrote the first version of Hubot and of course made it open source to automate some of our company chat room. And, you know, Hubot, you know, knew and knows to do a lot of things like, you know, deployments and, and also some of the tests and everything else. But he was a very, very, you know, he led a very private and very introverted and a messy life. So we just, you know, rewrote him. And even today we use Hubot for automating a lot of tasks in our development processes, be it like kicking off CI builds or, you know, running entire deployments, et cetera, very, very easily through chat ops commands as well. You know, as an example, you know, we use Hubot to deploy, say, a specific branch from a specific repository into a certain environment. All right. So we use, we use a methodology called as branch deployments rather than, you know, deploying main. Again, you can find a little bit more about why we do that and what's the rationale on the GitHub blog. But an example of using something like Hubot to run a deployment of a branch, you know, with something like this, where we just go into, you know, a specific chat ops room and then just say Hubot deploy, whatever is the repo slash branch to, you know, an environment. And then, you know, that's, that's where things kicks off. So I'm not going to dive a lot deeper into this part, but I wanted, you know, to use that to set a context of how all of this really helps. You can read more about that in one of the blogs I co-authored on the GitHub blog. You can just go to gh.io slash pave path and, you know, read a lot more in detail about how this all works. What does, you know, containerization look like? What does a deployment pipeline look like, et cetera. Now, I gave you some examples, right, that we use it for web apps. We use it for, you know, computation pipelines, batch processors, et cetera. But we also use it to set up a very, very key component of one of our products. And I'm sure, you know, when I say that you will be able to guess what it is, which is of course, GitHub co-pilot. Now, GitHub co-pilot for those of you who might not be aware is our AI coding assistance tool, which is widely adopted by developers worldwide and has, you know, multiple features that can help developers and organizations be a lot more productive. And one of them, one of those features of GitHub co-pilot is a chat-based interface that offers an AI-based coding assistance. So as an example, you know, this is what it looks like where you will be able to really give a prompt and, you know, generate code in multiple different scenarios. So you can go to gh.io slash co-pilot to know a lot more about it. But one of the key components that we run on our paved path for co-pilot is for using, you know, the client-side extensions. Now, the client-side extensions of co-pilot communicate with a server-side, you know, API to deliver all of those amazing AI-based suggestions that developers use while coding. Now, you know, we run this again, API using our paved path. And as you can guess, this API, you know, gets massive amount of requests and needs to handle a lot of load because millions of developers are using GitHub co-pilot daily, you know, in and out for their everyday life. Right? And, you know, this is, this is, of course, one of the examples that, you know, gives you an idea of the kind of scale and load that our paved path has to support and ultimately are underlying Kubernetes infrastructure. And this is just one of it. And along with a lot of the services, right, makes it very, very necessary that the Kubernetes infrastructure has, you know, and the paved path has a high degree, very high degree of reliability and availability because of the API usage and all of these services as well. Now, what it means and what it comes down to is that, of course, delivering a lot of this reliability and availability is possible when we are able to operate our Kubernetes clusters a lot more efficiently, right? So if you think about it, as a end user, someone who is using co-pilot or some of the other services would expect that reliability and availability, which in turn would be expected off the paved path, which in turn would be expected off our Kubernetes infrastructure, you know, which in turn means that being able to operate a large, you know, multi-region, multi-cluster, topological cluster will be, you know, much more important. Now, of course, you might have heard of, you know, a lot of different ways to do this and administer this right from, you know, kind of automating on top of the native API constructs or third party tools, et cetera. You know, we kind of, you know, chose to do that a little bit differently, mainly because of the scale at which we need to help support our paved path and also partly because of some of our engineering and security practices, because of which we chose to implement a custom solution to really operate our Kubernetes clusters a lot more efficiently. And one of the things that GitHub is doing is we like to run as much as possible through code and codify a lot of our tasks. Be it something as simple as probably, you know, managing identity and access for a lot of the employees or something which is a lot more specialized like infrastructure as code. We like to kind of codify some of that as much as possible. And even for managing our own Kubernetes infrastructure, we decided to adopt a very similar approach of, you know, managing it through code. And to be honest, we took some inspiration from how Kubernetes orchestration works as well. So let me, let me give you an idea of what this really looks like for us, you know, in terms of operating some of these things. Now, what we've done is we have created an internal platform versioning spec that really helps us create our own configurations, bundle some of the common components into a well-known config, and, you know, create a very deployable artifact. Now, this, this helps us make sure that, you know, a set of common components work together as a platform at the Kubernetes layer. And also it helps us treat our infrastructure platform at a almost like a state machine, but at a higher abstraction level than what some of the other tools might offer. And also to ensure flexibility in overriding on a per cluster configuration level whenever, whenever it's needed. Now, you know, as an example, what this spec looks like is something like this. So you would see that it's very, very, you know, similar to say like a CRD, but, but not exactly. We wanted to make it that way. And can see that it also encompasses, you know, things like your Kubernetes versions, HCD versions, network plugins, you know, secrets, and a whole lot of things as well that are kind of together versioned in a single platform spec. Now, these specs are maintained in a, you know, a separate repository internally, where we kind of organize it by the cluster because we operate in a multi cluster topology, and also grouped together by a specific platform version. What this happens is that these configurations, which are their aspect are then hydrated, expanded into a full suite of configuration along with the necessary files, necessary, you know, system D units and everything else that are required to run an entire case cluster, and again stored in an artifact repository that's ready to be deployed. I'll show you, you know, how, how this happens. All right. How does the artifact really makes its way into a cluster when you're looking at a platform spec like this? Now, let's look at a scenario. Say, whenever a new configuration, you know, has to be deployed to the cluster, say to perform an upgrade with with HCD or anything else that needed. There is first an intervention by one of the operators, the Kubernetes operator, whoever is an admin. So they go into this repository where all of the platform version specs are there, and they make a modification to that platform spec, either through a version bump or whatever else that needs to be done or override for a per cluster configuration. And then they create a pull request, of course, with all of those changes, because all of these platform version specs live within a repository. And what happens is that when this pull request is created, we again, automatically kick off a CI, which creates a bundled artifact with all of the changes. All right. So like how I mentioned, there is the platform version spec which defines, but the CI really, you know, when, when the, when the CI gets kicked off, it does all of the hydration, expansion, and also creating the deployable artifact into a single bundle that will be stored in the artifact repository. So when, when this has to be deployed, when this pull request has to be deployed, what happens is that the bundle that was created, which is completely expanded and hydrated along with any other configuration within the, you know, repository gets deployed to every node within the you know, within our underlying Kubernetes infrastructure. All right. Now this is, of course, at a, at a CI level or a CD level, I would say, which includes creating the bundle config and also the deployment into every node. But what does this, what does this node level, you know, rollout process look like? Yeah, because we came all the way from looking at, well, you know, this is the paved path, which really helps us, you know, be successful with the whole lot of our services and also how it's really important for operating. And then we saw that, all right, this is how we do with the platform version spec. But what really happens at a node level? What is this, you know, lifecycle look like for the nodes and some of the orchestration pieces? So let's, let's take a quick look at that. Before that, let's see what are the components of this design that we have as a part of, you know, our kids infrastructure at a node level. Of course, this is a simplified version, but here's a control plane node. You can see, you know, the usual standard components, like, you know, that you can see, like your container runtime, Kubelet, it CD, your workload part, etc. And, you know, again, a more simplified version of a worker node, where you have your cube led your container runtime and workload pods as well, right? So this is this, of course, a standard control plane and a worker node. And apart from some of this, we also deploy a few other components on our nodes as a part of the provisioning, you know, process itself. Now, on all of the nodes, irrespective of whether it's a control plane node or a worker node, we deploy a custom node agent, all right, as a part of the provisioning process itself. What is this node agent? Why is it important? I'll come to it. And specifically on the control plane nodes, we also deploy a coordinator agent, okay? So there is no agent on every node and also a specific coordinator agent on the control plane. And apart from all of this, we also have, you know, an event bus for all of the communications, you know, between node agent, coordinator agent, et cetera, right? Like how I mentioned, it's a little bit of, it's a little bit of, you know, what works for us for our scale and for our practices, and also some inspiration from Kubernetes orchestration. Now, you know, previous slides you might have again recollected, but just to reemphasize that is whenever there is a PR change to the platform cluster version, right? The bundled configuration, what would have been created as a part of CI and deployed through the pipelines, you know, would exist on the file system of all of the nodes. Now, one of the things of course, which will happen is that at that point, when the deploy happens, the CD pipeline would have deployed all of the bundled configuration and then, you know, extracted it on all of the nodes, et cetera. But it's still an artifact. It doesn't really do anything, right? They need to be applied for the desired changes to come into effect. And that's, that's really where the coordinator comes into, comes into picture. Now, again, the coordinator is, is a demon agent that runs on each of the control plane node. And it's really responsible for orchestrating all of the cluster wide configurations that need to be done. Now, how does this do this? I'll share a few examples. They're not in any specific, you know, order, but it does so by, say, writing the desired cluster state, what whatever is needed into a config map on the cluster. So it's available, you know, to be referenced to the Kubernetes API as well. And it again, writes whatever is the node platform version, you know, state to the node resources as well. So so that again, you know, there could be a platform version that's separate for nodes, separate for control plane nodes, et cetera. And then again, it also does some of these things like, you know, removing, uncoordining the nodes and, you know, putting them into a maintenance mode, not necessarily just uncoordined, the maintenance could mean other things as well. And also ensures that of course, only a save budget of cluster resources are unavailable at any time. And also, you know, it publishes events to the event bus to trigger a node to update. That's the means of communication. And also, it consumes some of these success and failure events that gets generated by the node agent. All right. Now, again, you might be thinking that well, hey, this is isn't this something that we can probably do using ABC tool or automate on top of, you know, on top of the Kubernetes API itself. But I'll come to that also a bit. So to perform all of these operations, there are a few communications that needs to take place that the coordinator really need to communicate with. First is, of course, it needs to communicate with the Kubernetes API itself, right? For example, before putting a node into whatever is a custom maintenance mode, it would have needs to do, you know, a simple uncoordined drain and all of those things and any other painting that that might be needed, et cetera. So it works extensively with the Kubernetes API also communicates with a lock service because the coordinator is a demon agent running on every control plane node. You know, it of course wouldn't be feasible to run this without a lock service so that there is only one process of the coordinator that's helping orchestrate and also of course communicates with the event bus itself. You know, to communicate with the node agents and the rest of the systems, et cetera. So what does this do? All right, at the end of the day, what does the coordinator do? It really looks at the cluster wide configurations. All right, it doesn't look at any of the node specific configurations. So what it does is that, you know, it keeps attempting to iterate until all of the nodes are running the desired platform version. Remember that it publishes the node platform version to all of the node resources. And then, you know, it keeps iterating until this happens. Or, you know, NF nodes become unavailable. And it exceeds the cluster budget in which case, you know, an operator interference might be needed to figure out what's what's really happening here. All right. Now, like I mentioned, all of this is something which happens for cluster wide configuration. But something needs to apply, right? Something needs to actually perform all of these platform version updates on the node. And that's where the node agent comes in, in our design. Now, again, the node agent is also, you know, a demon that runs on all of the nodes, workers, as well as control planes. Now, the platform version spec could be a little different depending on whether it's, you know, a worker node or a control plane node, etc. But what it is again, at the end of the day responsible for is ensuring that whatever is the desired platform version spec converges with whatever is the current platform version spec on that specific, that specific node. All right. And again, if you recollect, I mentioned that as a part of our CD pipeline, we would have deployed all of the bundled config as well to all of the nodes. So the node agent is what, you know, is responsible to make sure that all of that bundled config is, is actually running, right? Because that's not what the coordinator would do. That's what, you know, the node would do. So what would it do? It does something like, for example, hydrating any node specific templates. Now, some of it would have already been hydrated and expanded in CI. But there might be some node specific templates, which it will, you know, re hydrate again, if it's, it's not pre hydrated in CI. And it does like, of course, clearing out all of the container runtime, Kubelet data directories, etc. to move it into the desired version moves any of the files into place like any newer binaries, or, you know, any other system D units, does the reloading and reconfiguring at a node level, and also, you know, runs any of the necessary system configurations through a config management tool. And also, you know, does things like restarting services, etc. Whatever is there. So now the node agent, of course, will begin converging the node to the desired platform version after, after consuming an event that is published by the coordinator on the event bus. So if it successfully converges, it will again publish an update on the event bus, which the coordinator will consume. Or else it will publish a failure event, and then the coordinator determines what really needs to be done. So like, you know, like I mentioned earlier, you know, the way for an operator to perform some of these operations, if you recollect the diagram is to start off with a pull request with changes to the platform version, you know, spec as well. So what we've done is we've gone ahead a little bit step further and actually created internal CLI tools to help make some of these changes and, you know, processes very, very easier. For example, we have, you know, a CLI tool that can perform some of the common tasks, you know, beat version updates, patches updates or anything else that are needed, run some of the CI jobs locally, create a local development cluster just to test out some of these platform version, you know, changes, build some of the images, etc. So that it's much more easier for the operators to interact, you know, with the coordinator as well. And also, of course, another CLI tool where, you know, the operator can interact directly with the coordinator, whoever is the coordinator leader, directly instead of through a platform version of course, what is the current state of the coordinator and the agent who is the leader and, you know, all of these things, right? So if needed, it can also be used to manually change the state of the cluster if needed, but that probably happens only in scenarios where it would have to debug something but usually happens through the workflow of, you know, a pull request itself that I mentioned. So just, you know, quickly summarizing, right? What we, you know, really looked at is how some of the services that all of you might be using depends on the paved paths that we have created for developers and how some of that paved path, you know, is responsible for serving such massive requests at scale and its dependence on the resilience of the Kubernetes platform, which is, you know, in a way, again, dependent on how we are able to operate it efficiently. And I think that's where, you know, this specific agent-based lifecycle management design that, you know, that I spoke about really helps us in reliably and efficiently operating our, you know, Kubernetes clusters at a very, very large scale and it works very well when coupled with some of the other tools, you know, be it configuration management tools, infrastructure as code tools, CICD pipelines, Kubernetes and of course, you know, the GitHub platform itself and adding to that things like the CLI tools really help further speed up some of these changes and make it very, very easier to quickly ship some of the changes to our infrastructure, you know, to manage, et cetera. So at the end of the day, the bottom line that I was trying to get at is that for us at least, the reliability and how we operate our Kubernetes infrastructure is a very, very key deciding factor for, you know, the dependability of our internal pay-of-path and the services, you know, that it does. As an example, think about it when I give the first example of, you know, the co-pilot using an internal API, you know, which is running on the pay-of-paths to help, you know, make all of this happen. So I believe that's, you know, that's a quick summary of this specific agent-based design that, you know, that really helps us do all of these things. So that's about it. Thank you so much. And, you know, if this has been really useful for you, I'd suggest you please share the feedback on, you know, on schedule, on sked.com as well, or drop us a note on GitHub. I'm happy to take the conversation forward, you know, here or in the GitHub booths, we are there in D3, or you can reach out to me on MV Current on Twitter, LinkedIn, et cetera. So thanks a lot. Thanks for your patience. And I hope you have a great, great rest of the conference as well. So thank you so much. Merci beaucoup. Bonjour.