 Hi everyone, my name is Mani Vanan and I have my colleague Satya here, so we are both part of developer experience team at PayPal, where we build end-to-end software development experience for all the developers at eBay and PayPal. So one of our responsibility is to provide a continuous integration platform for all the software engineers at eBay. Predominantly, we use Jenkins and my talk is going to be how we optimize the environment where we provide Jenkins to all the developers at eBay. So how many of you use Jenkins here? Almost everyone. So I should be careful with what I'm speaking. So at eBay, we take CI very seriously. Every line of code that we write goes through Jenkins before it goes to live. So we run automation every day, unit tests, and all our QA tests are also automated. So everything runs in Jenkins before it goes to live. So before we talk about the solution we implemented, I just wanted to give a background about how we use to operate in the legacy. So initially, if you are an eBay developer and you onboard to eBay today, you will get a VM with Jenkins installed on top of it to do all your continuous integration stuff. So the VM is self-contained and it will have two executors. Basically, the Jenkins installed will have two executors on the master itself to perform all the builds. So this worked very well for the users in the sense for software engineers at eBay. But as developer experience, we saw that resources were poorly utilized. So at one point, we saw that around 2,500 VMs were given out. So out of that, only 10% were really used. Others were sitting idle all the time. Within the 10%, only it was used hardly 15 to 20 minutes a day. Other times, the resources were sitting idle. So if you think about the cost of maintaining these resources, it costs around millions of dollars. So we wanted to utilize the resources, optimize. So that is when we moved to a cluster-based model. And before I get into the architecture, this solution is basically based on mesos. So how many of you used Apache Mesos? I'm lucky now. So Apache Mesos is a cluster management software. So it is easy to configure and easy to, and of course it is open source. So when you have a bunch of machines, you run Mesos master demon in one and couple of machines, you run Mesos slave demon, then you build a Mesos cluster out of it. And Mesos supports high availability. It is active passive model. At any given time, one will be a leading master, other masters would be passive. And for leader election, it uses ZooKeeper. And with that, you build a Mesos cluster and to utilize the power of Mesos, you need something called scheduler. Scheduler is the one which runs task on top of Mesos. So when you have Mesos slave, so Mesos slave's job is to tell master that I have resources. Say I have five CPUs and 10 gigs of RAM. If you want to use it, use it. And this is called offer in Mesos terms. So slave says that I have this advertises itself that I have resources. And Mesos master sends it to the scheduler that is registered. So marathon is one of the schedulers which is used to run long-running jobs. So end of the day, if you see Jenkins is just a Java process. When you have Java and Jenkins bar, you can run Jenkins anywhere, wherever you have Java installed. And when you have Jenkins and you want to build, you want your Jenkins to talk to Mesos. So that is where Jenkins Mesos plug-in comes into picture. So I am one of the committers in Jenkins Mesos plug-in. And Satya has also contributed a lot for Jenkins Mesos plug-in. So initially, it was not stable. But right now, there are a lot of active contributors. And right now, the plug-in is very stable. So with this plug-in, you can configure a Mesos cluster to talk to. And Jenkins will spawn a build slave only when there is a demand. When I mean demand, it is a build-in queue. So you go to Jenkins and click on build now. Only when the build goes into the queue, Jenkins actually spawns a build slave out of the configuration you have made in Jenkins Mesos plug-in. And it will give you a slave. So there is an idle timeout. It will do the build. After the build is complete, the resources will be given back to Mesos cluster. So this is how we do build part. So legacy, it had executors self-contained. So where each VM, I mean, the master itself would have executor. But in this model, we used build form in Mesos so that the build resources are taken only when there is a demand. Thank you. So this is the whole architecture of what we have, the current solution. So the top layer is CIS where it is continuous integration as a service. And this is where the entry point where the people who want the CIS would call the REST API. And it would actually give the request to Aurora scheduler. So Aurora scheduler is for long-running jobs. So I spoke about build part in the previous slide. But we also provisioned the Jenkins master itself in the Mesos cluster. So it is lightweight, and it doesn't have executors on itself. So we basically spawned with resources, much less resources, 0.2 CPUs, and 256 MB RAM. So its job is to just run Jenkins, and the build part will be done dynamically when there is a build. So Aurora is used to spawn Jenkins master because it is a long-running job. In the sense, it has to be alive all the time. So when the slave that is running Jenkins master goes down for some reason, it has to be spawned immediately in some other slave that is available in Mesos. So that is the job of Aurora. It makes sure the process is running all the time. Jenkins master is up and running all the time. And so Mesos master, so from Aurora, it goes to Mesos master. So Mesos will be sending resource offers to Aurora. So when there is a suitable request that, I mean, matching request that Aurora finds, it will immediately spawn a Jenkins master. And we use NFS mount to store all the Jenkins states. For Jenkins, use a file system to store its states. It doesn't use a database. So we use the NFS mount. So whenever Jenkins goes down for some reason and it has to be spawned in another slave, that slave would also have the mount and the Jenkins home directory in it. So it will be able to get all the things that the user has modified. So say the user has installed a new plugin or if he upgraded Jenkins, whatever. So everything will be available. The state will be persisted in NFS storage. So these are the results that we have got. So the main goal to move towards this model was to resource utilization efficiency that we have got. So instead of running one Jenkins in one VM, we moved to a model with lightweight Jenkins where we spawned 14 Jenkins instances in one virtual machine and shared Mesos slaves and build slaves on demand, which I explained in a previous slide. So all the technologies that we used for this building, the solution was open source. And we were also able to contribute back to open source. And the time to provision. So when we had to provision a virtual machine itself, it was minutes. It was taking around 10 to 15 minutes. But right now, provisioning became seconds. And it is highly available because Mesos is highly available. And Rora scheduler also runs in highly available mode. That is also one active and others are passive. Not passive, actually they proxy the request to the leading master. So when a leading master goes down, so others would take up the leader role. And so this allows elastic expansion. So when I see that there is a need for more Jenkins instances, I just need to throw some more virtual machines into the Mesos cluster. I mean, some more Mesos slaves I have to configure in the Mesos cluster. And we have a software CI dash. This is for monitoring performance and fixing issues proactively. So if there is an issue, so say if there is a queue for some reason, the builds are in queue for a very long time. So we get to know that before the user gets to know that there is an alert. So this is for proactively monitoring and fixing issues. So that is pretty much I had. So if you have any questions. I have a basic question. What's the difference between Rora and Marathon? Schedulers on Mesos? Yeah, actually both are schedulers. Both are for running long running jobs. And Marathon was actually started by Twitter and right now it is both are open source actually. So Marathon is managed by Meso Spear. So Rora is a scheduler that is still managed by Twitter. Yeah, but apart from the... Yeah, apart from that, initially we started with Marathon because it had a very good rest interface to talk to. When you have to communicate with Marathon, there is a very good rest interface to manage, kill your apps or do whatever you create an app or whatever you want. But Rora lacked that rest interface. It had a command line, only the command line interface. But Rora was more stable than Marathon. So Marathon in the way it was growing fast. So for example, they included Docker support immediately when Docker was available. At the same time, it was bleeding actually. When you add new features like that, there were Marathon was crashing a lot of times. So that's why we actually started with Marathon, but later moved to Rora. Any more questions? Okay, thank you, Mani. Thank you. So I would suggest if you are using Jenkins, I would suggest you to use, I mean try out Jenkins Meso's plugin. And if you are interested, just join the community. If you want some feature, just if you like to contribute, just jump in and we'd welcome. Thank you.