 Good morning, everyone. Welcome to my session. I'm Shingmar from eBay. And it's such an honor to be back in Shanghai again to talk about our Kubernetes cool projects. And thank you all for attending. Today, I'm going to talk about how we build and manage Kubernetes with Kubernetes, which is basically our eBay's fleet management system and the way we run our private cloud with Kubernetes and GitOps. So this is the agenda. I'm going to give a very quick introduction first on eBay's Kubernetes deployments. And after that, I'm going to be introducing our fleet management system based on Kubernetes. That builds and runs our private cloud. In a deeper dive of that system, I'm going to walk through the way we build and manage IAS with Kubernetes. Basically, we build compute providers based on Kubernetes and how we build eBay's Kubernetes clusters with our sort operators and controllers. And in the end, I'm going to share our GitOps practices to run our fleet and lots of our cool features inspired by Kubernetes. First things first, eBay is one of the earlier adopters of Kubernetes. We started since 2015. We run an internal distribution called Test.io. Within the past few years, we've moved massive production workloads into containers running on Kubernetes. As of now, we have 50 plus production clusters. We run multiple VPCs for different environments. We have both flat network and overlay network based on OVN. And we have multiple 2,000 plus sized clusters with heavy production workloads. For now, we have roughly 160,000 pots running on about 30,000 hosts. Most of them are bare models because we only run bare models for production. We have so many different kinds of workloads running on Kubernetes now. We have our web services for sure. We have databases, search engines. We have Hadoop and AI and machine learnings as well. In the past, the European KubeCon, I had to talk about our practices to run high performance workloads and our performance training. If you are interested, you can always find the splice online and the presentation as well. We are also running on the edge. We have around 20, actually 15 edge clusters across the world. And we are running our Envoy edge proxies, as well as software load balancers on top of Kubernetes there, where we do a lot of SSL termination web caching to accelerate our remote eBay users' user experience. We are rapidly growing our footprint on the edge as well. So our ultimate goal is to unify our fleet with Kubernetes, and we are trying our best to get there. So now, after getting some idea on eBay's Kubernetes footprint, you might wonder, how do we get there? And the short answer is we use Kubernetes to make it all happen. And that is our fleet management system to build and manage eBay's private cloud that runs Kubernetes with Kubernetes. As most of you know that Kubernetes is not just a container platform, it is a portable system that can be used to do many different things in many different ways. We use it to manage and build our private cloud. That is our fleet management system. Same as many other Kubernetes services, our system runs with API server and its CD cluster. In eBay, we automated this process so that you can spin off a few pots and create an API server in its CD with its CD operator. You can run them in a single cluster or run multiple clusters as a H3A setup. Once you have the API setup, we create a lot of CRDs, which is a fun part, which is the modeling. And we model almost everything and anything. So from hardware, which we model our entire data center from racks to switches to VPC, softness, browse, and eventually compute assets, where we can provision hosts from. And we are also touching base on the software stack. We start from operating system, and we model compute nodes, and we build Kubernetes with salt, so we model the whole salt stack as well. So it pretty much covers from infrastructure to application and from hardware to software. So once you have all those CRDs and models, you write a lot of controllers. We write provisioning controllers that provision compute nodes just like creating Kubernetes is creating pods, and the node pool controller will create a set of compute nodes with this exact same setup of config. And we support multiple providers, and we have a building homegrown provisioning system that I'm going to talk about later. And then we have a very powerful salt operator that can build Kubernetes with salt. We can install Saltmaster on the compute node from a gift commit ID, and then you can build a cluster out of it. We have salt deployments and a few other controllers to manage a set of salt minions so that you can run multiple Kubernetes masters and Kubernetes nodes. We have many other controllers as well, for example, like rack controllers, which takes care of the bootstrapping. We have the scheduler. We have IPAN to do IP allocation. And of course, DNS controllers and remediation controllers that takes care of the compute nodes' lifecycle management. And with all those controllers and models, we have many functions that we are inspired by Kubernetes. For example, we have transactions, rolling updates, and disruption budgets, and all kinds of stuff. So this is a typical model-driven automation. This is the way we think we can use to unify eBay's fleet with Kubernetes. OK, let's get to the architecture view of a single availability zone's fleet management system deployment together with their models. So here is the API server in ECD. And then you have Testmaster, which is a set of many, many controllers that's running on top of this API server. And then what's inside that API server from the left side? You can see there are a lot of IAS feeds, for example, hardware, racks. And then from the right side, you can see there are a lot of Kubernetes bits, for example, salt and Kubernetes nodes. The fun part is that the whole fleet management system is also running inside a Kubernetes cluster that it views. So you can build a new availability zone from external cluster that manages this new availability zone to build the new clusters. And then after that, you can migrate the whole thing inside that availability zone so that it's self-contained. It's funny that you can have the building blocks running inside a cluster that it built. So next, I'm going to take a deeper dive onto both IAS part and Kubernetes part to see how we manage them. It's the same fleet management system. First, let's zoom in into the IAS layer to see how we build and manage it. So Kubernetes needs compute from providers. We have different providers, for example, OpenStack as a private cloud provider and GCP as a public cloud provider. And then our fleet management system needs to define interfaces from our controllers to work with those providers. For example, a typical example is a provisioning controller would define the interfaces to create, delete, and reimage computes. And then different providers' clients will implement those interfaces. We support OpenStack and GCP for now. And as many of you know that eBay used to be a huge OpenStack shop. Our first few Kubernetes clusters are running on OpenStack as well. But once we are running more and more production workflows, we are trying to run most of our stuff some bad metals. And we start to think, why not building a bad metal compute provider by ourselves so that we can get rid of all the complexity from OpenStack so that we don't need to deal with so many different components and especially rapid MQ. And that sort is driving us to build a homegrown environmental computing system to do simple stuff. So we do kickstart and preceding. And it's a Kubernetes system. And it's a provider for our fleet management system so that it can build bad metals within itself. Itself-content is highly scale. And once we do that, we start to create CRDs for all the data center bits. And then that inspired us to move further to manage and orchestrate the fleet, which means that we're going to model the entire data center. And it empowered us to do many other things. For example, manage the computer-nosed lifecycle. For example, you can do remediation. You can do clean-ups and you can serve as a CMDB as well because many of your objects are already in Kubernetes. And eventually, after we have bad metal clusters built out, we still miss VMs because not everything is in container for us. We still have windows will close. And how do you deal with that? And that's how we start to think, why not just to build a VM provider with Kubernetes as well so that we can take VMs from those Kubernetes clusters? So this is the overview of our IAS layer to be managed by Kubernetes. I mostly focus on the compute bits. So let's zoom in now and start from get go. So this is a lifecycle of a typical eBay production server's lifecycle. In our private data center, once a new rack is landed and powered on, all the assets would find the DHCP server from Switch's DHCP helper and get an IP. And DHCP find the next server to TFTP. And we have a default IPXE so that it gets a pixie boot image. And from that pixie boot image, we have a startup job to put a Git report to do a lot of bootstrapping jobs. For example, set up the BMCs and do some sort of discovery so that we know all the bad battle specs. For example, serial numbers and asset tags, racks, and et cetera, et cetera. So this will send the whole thing as a payload to Forman. Forman is a server's lifecycle management system that connects the provisioning system with TFTP and DHCP. And it provides a pixie and HTTP infrastructure for your installation. So once it's sent to Forman, Forman is going to send it to our registration system that populates the CMDB. And then our fleet management system can kick in and do the model of the rack. So our data-centered modeling is based on a rack unit. From a rack, we can see all the subnets from that rack. And then the routes, the L2 domains, the IP addresses, and eventually the compute assets. So this is the, well, it's a little bit small, but this is the compute asset object. You can see there's BNC IPs and ETH0 MAC addresses and manufacturer serial numbers, all kinds of stuff so that you can use later to provision by model hosts. So once the whole thing is modeled, we have the assets and subnets, right? There are CRDs and objects in API server. We can provision hosts. In the previous slide, we modeled Forman. Forman is also serving here as our TFTP and HTTP service so that unlike any traditional provisioning system where you send a massive payload into an API, you just create a compute node object in API server. And it's pretty much on this side. You create the compute node with an OS image. We use CentOS atomic. And then you specify a flavor. Flavor is an object that we created to tell controllers to automatically generate a kickstart snippet to partition the disks. And then you have the specs for the asset selector based on the hardware or specific functions. And once you create the compute node, scheduler starts to look for the free assets and it's gonna fill in the status field as well as an annotation and specs to allocate a specific asset. And after that, IP allocation will start working on it. So it's subscribed on the specific asset tag from the object and find its network zone which is basically the VPC. You can see here my network zone is production. And it's gonna find the asset, find its subnet from the rack and the switch. And it's gonna pick the production subnet and then find an IP from that specific subnet. And then it's gonna annotate and putting to the compute nodes specs so that the next controller will start working on the specific provisioning which is a major provisioning controller which takes the specific compute node and create a form and host. And in the middle of that it also automatically generates the partitioning bits there so that it can tell how do I partition the disks? For example, I have three disks, I need to set up LVM, I need to set up MDADM. So the whole thing you can model from CRD and then our controller will automatically generate and populate that. So basically controllers works in order so they subscribe to different fields of the compute node objects. And it also puts finalizers into the compute node object so that when you delete the compute node, for example, DNS controller will see, okay, I put on a specific finalizer for DNS. So I'm gonna remove DNS as well. IP allocation will release the IP, et cetera, et cetera. So the whole provisioning system works like this. It's completely declarative and then each stage you can track from the status and of the compute node object. So now it pretty much covers the provisioning bits and you can provision bare metal hosts. We start to think about how do we run virtual machines, right? So let's assume you have bare metal provision and you have a bare metal only Kubernetes cluster. Well, we jumped a little bit but we are wondering if we can get VNs for our, for example, Windows workloads, right? To let Windows only run on virtual machines running on top of Kubernetes. So basically VM on Kubernetes has two major technology versus Cata container. It's super popular. I'm pretty sure most of you already heard of that. So it's pretty much running native and secure workloads in Kubernetes. And then there is pure virtual machine solutions like Marantys' Birklet and Red Hat's Kupor. So we pretty much do both. So we use Cata as well to run our, Cata is a runtime for us for our secure workloads so that you have a very quick virtual machine spin off for the specific secure part you want to run. And then the Cata agent inside the VM will create the container and parts for you. We use Cata for our secure workloads. But Cata is not easy, at least to me, it's not so easy to be used as a VM provider to get VM computers. That's why we explored on some other solutions and we eventually picked up Birklet. So Birklet is our computer provider. You can create a pod in a specific Kubernetes cluster and it's gonna get to a virtual machine. So basically VM is created as a pod. And that's only for our non-container workloads like Windows, just as I mentioned many times. So basically we can deploy any QCao 2 image that we use to run in OpenStack. And it's best for non-containers that we have. And then you can just throw it in and then run it in our fleet management system as well. By embedding Birklet as our VM provider, we can build a Kubernetes cluster on top of a Kubernetes cluster, which is the fun part, right? So you have the VM as your compute provider and then you can build a VM Kubernetes cluster on top of a bare metal Kubernetes cluster. And that's super useful for us because we want to use a lot of CI CDs. We do a lot of end-to-end testings because we run our own distribution of Kubernetes. And this is pretty much a compute node. Actually, this is a node pool spec for Birklet. You can only see the difference between this one and a specific bare metal compute node is that it's way shorter. It doesn't have all the flavor bits. All it has is the VM flavor from our Kubernetes clusters. And then we also have the provider equals to Birklet together with other bits like SSH keys and stuff. And it's gonna create a pod in Kubernetes cluster like this. And this one will create a virtual machine for us. We worked on a few areas like config drive and cloud in it, but eventually we can get virtual machines from Kubernetes. So now we have all those computes. We are going to build Kubernetes clusters. So we build Kubernetes with salt. Since we are a Kubernetes system, we start with model and CRDs, right? So we start to model the salt stack. So we have salt master and salt minions that have both objects managed by salt deployments, which create and orchestrate on both objects. So a salt master will take a commit ID from a salt get repo and set up a salt master that's later gonna be used to deploy Kubernetes. And then we have different other salt deployments. For example, like masters salt deployment to run salt high states with a master's role. And we have the kubenotes salt deployments to run the kubenotes so that you can run kubelet and kubeproxy. And then we have other features just that I mentioned like salt transactions that takes care of rolling updates because we are running like a huge cluster, 3,000 notes. You cannot run the updates at the one shot everywhere. You have to stage it, right? So we support rolling updates. We created this feature inspired by Kubernetes which is called transactions. It's gonna break the node pools, compute nodes into different buckets based on the strategy that you specify. We can do node by node, rack by rack and many others. You can even plug in your own strategy so that we can put the compute nodes into buckets and then run them stage by stage. That's our transaction. After that, we have controllers. We have a lot of salt controllers and this is the breakdown of that. So basically, a salt deployment is a one to one mapping to a node pool. If you remember, the node pool is a set of compute nodes that we provisioned that has the same spec of operating system and host config. And then salt master deployments will take a specific commit ID from Git and then deploy a specific salt master that's gonna be deploying our Kubernetes. This is the first one here, salt master. And then we have the role equals to master salt deployment here that's gonna deploy the Kubernetes masters. Basically, they're setting the role and grants and putters on the specific nodes and then run salt high state so that they can install API server and it city one by one. We fix the dependency for the it city cluster and from the salt state and it also has the salt transactions here. So once it has a salt transaction defined, the salt deployments will create the salt minions objects one by one because mostly for master nodes, we're gonna do one by one for both setting up and upgrades. And then you're gonna see different purposes of Kubernetes nodes. We have dedications, we have different workloads so that you can set up the remaining Kubernetes nodes. So on the right side, you're gonna see there are a lot of controllers. We have different operators that does different things. For example, salt deployment controller will create the salt transactions based on the running app strategy and we have auto resume controller that's basically doing buckets by buckets automatically based on the very important health monitor framework which we also took from Kubernetes. So each bucket will check the props and conditions to make sure that is ready and then move on to the next one. So by doing that, you can take a salt deployment, not one, but a set of salt deployment objects and then roll the whole cluster for both setting up and upgrades. So this is an example of the salt deployment object that we have. Well, it's a little bit too small, sorry. You have pretty much the git ripple here. There's a git commit ID and then you have the salt master setup bits and the salt minion setup bits so that you can know specifically from git commit ID on which Kubernetes version you're running and your code base is on what. So this is again, very declarative and how we run our Kubernetes clusters. So okay, now we have pretty much gone through the functional bits of our fleet management system. If you follow through, there are a lot of objects and a lot of CRDs and you can almost tell that every bit can be put into git and then we can run GitOps like out of the box. So let's take a wider view from bottoms up. So we run our operating system and we build our own operating system using OS tree. So operating system can be built from a JSON spec and then it's gonna build a tree and you have a commit ID that represents the whole operating systems footprint. And then we have a provisioning system which create a lot of CRDs and we have a lot of controllers to provision the bare metal hosts. And then you have salt to install and config Kubernetes. And then if you checking all those CRDs into git, you can run GitOps. And of course I missed one part here which is a compute and cluster life cycle management. We manage the remediation policies from compute node as well. So pretty much you can have everything checked into git so that you have the whole fleet in git. Part of that is the CRDs. The other part is your state, your code. And you can connect things together with two or three git repos and you can make everything to be in git. So that our way of doing things is GitOps for DevOps. So you have a PR driven operations and with everything declarative. And that's the way we build and scale our Kubernetes footprints in eBay so that you can build consistent and highly scalable clusters across the world for eBay. So that's pretty much the GitOps part. And the last time I'm gonna talk about our features inspired by Kubernetes. We have a lot of features inspired by Kubernetes. We don't have much time left but I'm gonna quickly go through them. So scheduler is similar as, our scheduler is similar as Kube scheduler is gonna pick some free assets based on your selector and end to affinity, affinity selector. It's gonna pick the asset and we talk many times about the voting updates with view transactions and buckets and strategies which is flagable. And then we have a destruction budgets so that you can limit from a specific namespace of a cluster to see how many nodes you can delete, re-image within a day, within an hour. That's something we took from the pod destruction budget and we use it from our host and for our upgrades and host our release and as well as OS upgrades. And then we have the health monitor framework which is also inspired by Kubernetes. We put a lot of props for our compute nodes that there are a lot of conditions and then the compute nodes, liveness prop and re-image prop is a major driven point for our remediation controller with which we can manage our Kubernetes clusters lifecycle. And then pretty much we have many other things that we did in this tier but thanks to Kubernetes, that's how it's possible for us to manage our fleet with a declarative system like this to make our life easier. So last is the summary. So pretty much everything is covered. I just want to say that with the scale and the kind of footprint we are running with Kubernetes, we need a lot of help. So we have a great China team here in Shanghai. So if you are interested, please directly engage with our folks here. I have the barcode here. Thank you so much Q&A. No questions? So a few years ago in the, I think the Boston OpenStack Summit was mentioned about the open source version, but I think later we're building too many eBay logics onto it. So there's no plan to open source it yet. The major driven for this talk is to go through the ideas of how do you use a Kubernetes system to manage a large clusters like eBay's running ones? Yeah, I mean, even for the small piece of the code some small components like remediation controller or something. Not I'm aware of as of now. We have some other stuff that we are planning on open source like the visualization bits. We have a plugin that we can visualize the whole CRDs and the data that we are trying to work with the CNCF to open source that, but I don't have any specific things yet. If you're interested, please directly connect with me. I can answer any questions. Okay. That's quite a day. Thank you.