 My name is Karthik Ghazala. I work in the Cloud Engineering team at eBay. Our team is responsible for all of infrastructure service and platform as a service at eBay. So, my name is Xiao Gang. I come from Shanghai, China. So, I'm currently a Cloud Engineer at eBay and I'm recently working on the next generation containerized platform based on Kubernetes at eBay. Today we're going to walk you through our journey in building a continuous integration platform on top of Kubernetes. We'll cover just a nutshell, an introduction to our infrastructure and our applications, what they do at eBay. And, you know, walk you through our journey in building this platform. Xiao Gang will go more into the details, the technology details. We'll cover what we've learned, what worked, what did not work, how we fixed things. And then, like, where we look into going further with this platform. I hope all of you should know eBay. We've been around for some time. And I hope most of you had an experience either buying or selling on eBay. We're a global company. We are one of the largest marketplaces online. We have 168 million buyers around the world. Quarterly, we do roughly $22 billion in gross merchandise volume. There's the total amount of transactions that happen in the marketplace. About half of it comes through mobile devices and more than half is international. So, we have quite a bit of customers across the globe. 88% of our sales are actually fixed price and more than 80% are new items that are bought or sold on eBay. At any given point in time, you're looking at more than a billion items for sale on eBay. So, a business of this size requires a very large infrastructure as well. We started our journey in building a private cloud about six years ago. We have a private cloud that does everything with containers, virtual machines, bare metals. All aspects of the data center are fairly automated. We manage a fleet of about 80,000 bare metals, more than 200,000 virtual machines, and roughly five petabytes of managed storage today that our cloud platform manages. So, what you see on eBay.com is a composition of about 4,000 different applications that are serving more than 100 billion requests a day. Our internal cloud platform hosts more than 95% of the traffic that comes into eBay.com. The applications that are written on eBay are very diverse in nature. We have a very good choice of programming languages our development teams use with Java, you know, traditionally in Java, but a lot more on Node and Scala and more recently with Go. These development teams are continuously changing their code, releasing their changes to production, and the continuous integration is a very critical piece of it. So, the platform that we're going to talk more in detail about that today does roughly 10,000 bills a day, over 300,000 a month. So, a few years ago, the continuous integration journey at eBay has been that there's a self-service way to get virtual machines, and every team would set up their own Jenkins instance on that virtual machine. They would find their own way to get the right set of plugins, configurations, and all that, right? That only went on for some time, then we realized that things are getting out of control. And a few years ago, we started a journey where we built a Mises-based continuous integration platform. We have two separate Mises clusters for masters and slaves, and we have a pass layer that provisions a Jenkins instance on this Mises, and users would get that self-service. We deployed a shared file system so things could be persisted in the configurations to the file system. About a year ago, we made a strategic shift towards Kubernetes. That's when we started the journey of rewriting the system to run on Kubernetes. Thanks to Carlos for his plug-in with Mises plug-in. I know he's sitting right here. I attended your talk this morning, very good. We run today our continuous integration completely on Kubernetes with Jenkins. We run two separate pools of nodes for masters and the builder parts. We have separate images for both masters and the slaves. We use service, we use ingress, we use traffic as an implementation for ingress controller inside Kubernetes. We use persistent volume claims, the construct of Kubernetes to persist the configurations for masters. That itself is backed by Ceph. It's one of our managed storage at eBay. We have petabytes of it that forms the backend for the PVCs. It's been pretty highly scalable and resilient for us. Let's go more into the technical details of the architecture. In the next few minutes, I will show you our CRS service platform in-deep on how we build our CRS service platform based on Kubernetes. Let's take a look at the architecture first. But before that, before go deep into architecture, please be noticed that the big blue rectangle is Kubernetes cluster. The light blue boxes are Kubernetes pod. The red line in the diagram shows CRS service platform controlling while the blue line in the right. It represents the data plan. Let's start from the control plan. If a user wants a CI, he or she can easily post the third party resource, CI config specification, either through PAS platform at eBay or against the Kube API server directly, to tell Kubernetes cluster that, hey, I want a CI and here is my definition. There is a CI controller running in the cluster which watches the CI config object and converts the user requirement, user definition of CI to a real CI instance. Here's the works CI controller do. So, firstly, it will create a persistent volume which will be used as Jenkins configuration storage. And then CI controller will spin up in a Jenkins master pod, mount with this volume, and we'll leverage replication controller to provide high availability for the Jenkins instance. Besides that, CI controller also create a service for this Jenkins master as well as the ingress object which wrote a request to this service. So, the blue line, when there's a build happens, the build request actually first touch the ingress endpoint of CI as a service platform. Actually, there's no ingress controller provided by Kubernetes, but we find that traffic ingress controller is amazing which can meet our requirement very well. Traffic ingress controller will help to route the build request to specific service and eventually the Jenkins master pod. Jenkins master will leverage the Jenkins Kubernetes plugin. Thank you so much. And talk to Kube API server to spin up Jenkins Swift pod to do the actual build. This is the whole architecture of our system. Before going on, let's think about a question. So why we need a third party resource CI config with a controller loop to spin up Jenkins master? Some people who are familiar with Kubernetes would say that that is amazing because that is model driven. But is model driven the real reason we use it? I would say no. Model driven is only methodology but not solution. We should always think from user requirement perspective. So let's think about it. What user really want? In most of the cases, what user want is a CI on a build while not a heavy Jenkins. So a CI is nothing but a process from source code to manifest. It's not equal to Jenkins. Jenkins is only one of the build engines. So let's look at the CI config model in detail. How we model the process from source code to manifest. We use third party resource in Kubernetes. We register a new kind. We call it CI config. The most important part here is the specification. If a user are using Git as their source repository, he need to specify the Git URI in the source section. Strategy means what kind of agent you use. If you're using Jenkins, you need to specify the Jenkins master image with volume size. There can be more and more other build engines besides Jenkins. In builders section, currently we support two major builds, standard build and generic build. As Karthik just mentioned several minutes ago, there are four major tech stacks in eBay, that is Java, Node.js, Scala and Python. As a platform service, what we provide to our user is the most popular cases, that is Java and Node.js, which covers almost 8% of the cases at eBay. We call it standard build. User does not need to provide their own builder image. They just claim their stack, for example, moving. For others such as Python and Scala and some very specific build requirements, we dedicate the manageability to our user so that they can build their builder image and claim it as a service configuration. From CI controller side, after the user puts CI config into Kube API server, the controller will do the real provisioning of the CI instance, which is PVC, pod, service and the ingress, I just mentioned. We also support hibernation. Why? Because per our experience, in our journey, we find that over 60% of the CIs, they are not actively used. Some CIs, they create from the first day, but no use actually because of different reasons. So as a platform, we provide a platform, we want to make a high utilization of our computer nodes. What we do is hibernation. By specifying hibernate flag in the CI config specification, CI controller will tear down the master pod while leaving persistent volume because there is data in persistent volume. And also, service can be deleted, but it will leave ingress. What's the reason? I will talk more about hibernation in the following slides. This is what the build really happens. When there is a builder request initialized either by GitHub or human being, Jenkins master will use Kubernetes plugin to spin up Jenkins sleeve pod to do the build. Jenkins Kubernetes plugin is nothing but a plugin which manages the Jenkins sleeve pod lifecycle on it. The plugin will compose a Jenkins sleeve pod specification according to the pod template configuration and a post to API server. Then there's a Jenkins sleeve pod up and do the build. There are a bunch of specification details you can configure in the plugin configuration, actually. But here I just want to list three most important ones we think here. The first one is Docker image. It's the builder image. We recommend our user to compose all the build dependencies such as libraries and packages into a same builder image. A builder image should be self-contained, which can be used in CI or shared by different development environments. Tender toleration, if you want your builder to tolerate it to a specified nodes, volume amount such as secret. For example, a user might want to push their build manifest to a protected repository, which needs credential. I'm sure that they don't want to share the credential with you, right? They just need to store it into Kubernetes secret and mount the secret into their builder pod. Host path month. So why host path month? I just mentioned that it is the best practice to compose all the build dependencies into Docker builder image. But if your project depends on too many dependencies, your builder image is going to be very huge, maybe 10 gigabytes. So it will take a long time for Kubernetes to bring up your sleeve pod because it needs to pull image. So some people would say that dependency can be downloaded during build time or run time such as Maven. But it still needs time. How could we solve this problem? We mount a specific host path from our Kubernetes node to the sleeve builder as the Maven cache, which means on this node, only the first builder needs to download the Maven dependencies. And it can be reused in the proceed build. In this way, we save a lot of time. Okay, ingress. Actually, without ingress, there's a lot of ways to expose your Jenkins master service to the user. But for example, we use load balancer web or node port. But we have met the following two major problems when we design CI as a service platform. The first one, every CI instance needs a durable access point for a user. If we use load balancer, that would be a huge waste of web. As you know that load balancer is really expensive. And there's only one pod, Jenkins master pod in each web. It's a huge waste. By leveraging ingress, we can use the L7 routing so that we only need one web for the ingress endpoint. The second one, an eBay, it is not allowed to use a wild card SSL certificate. It's not secure enough. And it's really hard to issue a specific SSL certificate for each CI in short term. Then by enabling ingress, actually we only need one certificate for the ingress endpoint. Here is an example. CIS.corp.eb.com is our ingress endpoint for CI as a service platform. CIS is the CI instance. That is L7 pod. So when user visits the CI, the requester will go to load balancer first and then reach traffic and ingress controller. Here, TRS just terminated. And ingress controller will do L7 routing. According to the mapping CIA with the specific service we created for the Jenkins master. And then just wrote the request to that Jenkins master in cluster using cluster IP. Hibernation. As I mentioned that around 60% of the CIS are not actively used. For example, not used in recent 30 days. So what we do is hibernation. The CI controller will delete the service and the pod to release the resource but leave PV and ingress there. It's very easy to understand why we leave PV because there is Jenkins master data configuration in the precision volume. But why we leave ingress? Hibernate a CI does not mean we delete that CI. Maybe user want to reuse it after 30 days. So we configure the ingress, the L7 routing from the existing service to the reactive service. That means if the user revisit their CI, the request will be handled by the reactive service. The reactive service do a very simple job. It just change the hibernate flag in the CI config specification. And then CI controller will wake up the CI by recreating the Jenkins master pod in seconds and monitor the existing volume and recreate service and configure the ingress routing from recreation service to that new service which point to the real Jenkins master pod. So in this way, in several seconds, user can get their CI back without any more click. Okay, this is all about the details of our system. CI as a service platform. And next, I will hand over to Karthik for the learnings and the roadmap in the future. Thank you. So a few things just to reiteration of what we covered in detail. I think the CI config as an abstraction is a very helpful thing for us. If users want to use something other than Jenkins, they can use the same abstraction from a platform perspective. We run two separate node pools for both masters and builders. On the builders is where we use the host path mount to leverage the Maven caches. So the builds happen very quickly. One of the things that we take very seriously at eBay is developer productivity. So we want to make sure that the builds are occurring as fast as they can. And whatever tuning that we need to do to make them faster, we invest into that. The pod templates is a very powerful concept. I think we saw much more detail today in the cloud B session about how to use pod templates. We've also built an internal monitoring system for this. We roughly have about 2,000 Jenkins masters at any given time on our platform. And we've noticed some issues. Sometimes traffic does not catch the creation of new CIs. So we continuously monitor all our CIs to make sure that they're available. If they are not, we look into why they are not. Last thing is we've been using the persistent volumes for CIs. And the volume attachment or detachment has been pretty... We went through some issues in terms of reliability. And we've introduced fixes into the Kubernetes layer as well as under the Kubernetes cloud layer to make that much more robust. Some of the things that we are actively looking into right now is... Today we use all of our CI platforms on one Kubernetes cluster in one availability zone in our private cloud. We're finding ways to see if we can span across multiple clusters. So based on where the resources are available, we can actually schedule the workloads across multiple clusters. Some of the challenges that we've noticed with PVCs is something that keep bothering us. So we are looking into... Maybe if object store is a good semantic that we can use for persistence. We're playing with Swift. We have a huge Swift in our private cloud today. A few petabytes of Swift. So we're looking at that. The other thing is the declarative pipelines. We don't use much of Jenkins pipelines features that we saw today, but we're looking more into it. Then one of the things that we are also looking to do is leverage public clouds. We've done a prototype this year to see if we can do our builds on public clouds seamlessly. That's an interesting area for us to look into next year. That's all we have, and we are actively hiding. If you are interested to work at the platform of our scale, please reach out to me. My email address is right there. Thank you.