 So, hello, everyone. Am I audible? Okay. So, thanks for coming. Myself, I am Mani Vanan, and I have my colleagues, so much shaker with me. So, we are both part of developer experience team at PayPal, where we build continuous, I'm sorry, end-to-end experience for all the developers at PayPal. And one of our responsibilities is to provide a continuous integration environment for all the developers at PayPal. So, in this talk, I'm just going to discuss about how we started our continuous integration journey and how we used to operate and how we evolved over the time and how the efficiency improved using mesos and Docker containers. So, at PayPal, as you see here, we take continuous integration very seriously. So, every line of code that is written in PayPal goes through the continuous integration process where the code is built and unit tests are run and on top of it, there are automation tests which are run and then it goes to production. Not even a single line of code goes to production without going through a continuous integration process. So, before I go on, how many of you use Jenkins? Okay, that's good. And how many of you tried mesos? Okay. Thank you. So, I should be careful about what I speak. I have a lot of experts here. So, Jenkins was our continuous integration tool that we have been using. We are one of the early adopters of Jenkins and we have been using it for couple of years now. Initially, we started using a single Jenkins instance for the whole company. So, people, developers would come there and create their jobs for integration or automation tests. So, this one Jenkins instance had around 40,000 runnable jobs and it used to run 30,000 builds every day and we had different stacks in PayPal. So, we had Java, Node, Python, C++ and whatnot. And so, this one Jenkins master used to have hundreds of slaves connected to it and we had a custom build system in PayPal which used to build 50 million lines of code in less than one minute. So, this one Jenkins instance we had a lot of limitations with that. It worked to an extent but it had a lot of limitations like it was a single point of failure. So, when something goes wrong, the entire company is affected. Whoever is building their jobs, it would get aborted and they have to start over again. And it wasn't scalable and change management was very difficult. We have to upgrade Jenkins or if we need to put plug-in on to our Jenkins instance we had to restart the Jenkins instance where all the developers would get affected. And there was no freedom for users. This is the main point. I mean, developers won't be able to upgrade their plugins themselves or if they want to do something, I mean upgrade Jenkins or plugins, they won't be able to do that. And the nodes, I told you about hundreds of nodes connected to Jenkins master in the previous slide. The hundreds of nodes Jenkins used to do resource management to an extent but still if I mean even if there are nobles that are running in Jenkins, the nodes are still connected to the Jenkins instance and the resources would be idle during that time. It could be used for some other workload other than say if they are idle. So, which wasn't happening with the single Jenkins instance. And plugins Jenkins is a very good in the aspect that it supports plugin. So, if you want to customize your Jenkins instance, you can write your own plugin and put it on top of it. But the most of the plugins that are available for Jenkins weren't able to scale up to the level we were using. We were one of the biggest Jenkins instance with 40,000 jobs. So, most of the plugins don't scale. For example, we had an issue with Kobachura plugin. So, it used to run out of memory often and we had to remove it. So, we came up with so, due to the limitations with one Jenkins instance, we came up with another model where we had a dedicated virtual machine for each Jenkins instance. So, if you are a developer and if you onboard to PayPal and if you want to create your own application, you would get this Jenkins instance on a virtual machine, on top of it. So, this virtual machine will have two executors in it. So, you don't have distributed slaves. The master would behave as a slave. It will have two executors and it will be able to do your builds. But, this users loved it actually because they had their freedom to upgrade plugins or upgrade their Jenkins instance. But, as DevOps, what we saw was that resources were not properly utilized. Out of this, I mean, at one point we saw that 2500 virtual machines were created like that. Out of this 2500, only 10% were really used. Out of this 10% only it used builds only once or twice a day. Which means the rest of the time the resources were idle. And if you think about 2500 virtual machines it's millions of dollars invested on hardware. So, although it solved the problem of freedom for users and removed the single point of failure, we still had the resource management issue, where we didn't use the resource optimally. So, that is when we explored MeSource. So, with MeSource what we did was we created a lightweight Jenkins master with just 0.2 CPUs and 256 megabytes of RAM. So, it won't build anything by itself, it is just Jenkins master and Jenkins slave also will be provisioned out of MeSource. That is using Jenkins MeSource plugin and this is done only when there is a demand. So, when a user goes and clicks on build his job or the SEM triggers his job the job goes into Jenkins queue. That is when Jenkins MeSource plugin will request for resources from MeSource and MeSource and send back the resource offers and if Jenkins gets an offer that matches the slave requirements, it will create a slave and it will do the build. So, after the build is complete, there is an idle timeout, after the idle timeout the resources will be given back to MeSource. So, we used Marathon scheduler for spawning Jenkins master. So, Jenkins master by itself was a long running task, it will be up and running all the time. So, whenever it goes down, Marathon used to respond it in some other virtual machine and it will be up and running all the time. So, with MeSource, we were able to utilize resources optimally. I mean the build slaves which require more compute are only provisioned on demand when the build goes into queue. So, this has given us optimized resource utilization and the other point is our operating cost reduced to 10x times. So, with 2,500 virtual machines we were using approximately 24,000 CPUs. Now, with MeSource we were able to solve the same problem with 2,400 CPUs. And there were 180 terabytes of disk before MeSource and 18 terabytes of disk after MeSource and the RAM also went down 10x times. So, and if you see the image actually these guys are trying to recover from a tire burst. They are trying to change the tire. So, this is how we used to operate without MeSource. So, when the virtual machine hosting Jenkins goes down we use, people would raise a ticket and we would go in there and provision a virtual machine for them which is going to take a couple of minutes and then I mean take the backup from the file system and put it on that and then recover it which took a long time actually hours for it to recover. Now with MeSource recovery was like this where it was instant. I mean when the Jenkins instance goes down on one virtual machine. So, marathon if the virtual machine which is hosting the Jenkins goes down marathon will automatically respond it in another virtual machine. So, which worked well for us and but the problem we had was the CA workload was too much for a marathon. So, the way we were using marathon was like we used one application for each Jenkins instance. So, up to 500 apps it was perfectly fine. It was able to handle it. But after 500 apps marathon used to go down. Even it had high availability but we had issues with keeping it up all the time. We had to manually intervene and bring up the marathon quorum again. So, that is when we went for Aurora. So, Aurora when we did our testing. So, it was able to handle more than 5000 jobs and be stable and we when we replaced marathon with Aurora in our cluster it was actually during the cluster was alive and no user impact. So, we just replaced it on the fly. So, this is a whole architecture. So, how did you get the question? Yeah. So, using marathon or Aurora you should be able to schedule jobs. They can be long living or there could be multiple instances of the job running which could be periodic jobs too. You can use both marathon as well as Aurora for the same purpose. But Aurora is more rich in that you can based on if a job dies what do you do. So, in that sense Aurora is more rich. For our purpose both marathon and Aurora for the functionality it works. Just adding to that point marathon community is very good. So, they are adding features I mean very active and they are adding features very actively. So, for example the Docker support in our marathon was added way back than Aurora. But for stability for our specific use case Aurora worked very well more than marathon. Okay. So, this I am just going through the architecture how we do our continuous integration with me source. So, CAAPI is the application which would actually be the rest interface for our customer which is another tool which is going to serve as user interface for the user. So, he will be able to create an application and create a CA and associate to his application. So, basically when a user requests a CA through the user interface the request would come to CAAPI and CAAPI would in turn request Jenkins instance to Aurora and Aurora is a scheduler which will be receiving resource offers from MISOS master. So, when Aurora finds the resource offer which is good enough to create a CA so it will spawn as CA instance. I mean choose the resource offer and ask MISOS to spawn the Jenkins instance. So, that will in turn go to Jenkins slave and your Jenkins master will be provisioned. Now, we so Zookeeper we use it for high availability and master MISOS master election. So, Aurora also supports high availability. So, it uses Zookeeper for leader election and so whenever the Jenkins master comes up it talks back to CAAPI and tells that hey, I am up on this host name and port and CAAPI basically puts a mapping in nginx with the reverse proxy URL. So, that users don't have to remember the host name port. So, for them it will be myca.papelcop.com slash myca name. Now, we use MongoDB for storage, persistent storage and so shift storage we use to back up our Jenkins master. So, if VM goes down and if the Jenkins master instance is spawned in any other VM, it will use the backup from shift storage which is done periodically and it will spawn the new Jenkins instance. So, for Jenkins, all the state is stored in file system. So, we just need to backup the Jenkins home directory in the shift storage for it to recover later if there is a failover. And for Jenkins slave use case, so whenever user triggers a build, Jenkins MISO's plugin will be used to provision a slave and if there are any slaves, I mean, resource offers from MISO's master that is matching the requirement, it will choose that offer and provision a slave. So, once the build is done, again, the slave resource will be given back to MISO's master. So, that is pretty much on how we operated with MISO's and MISO's and Jenkins. So, Soma will be covering the part where we had challenges, I mean other challenges after MISO's and how we efficiently operated with Docker. Thank you. Thanks, Mani. Good evening everyone. So, using MISO's we have been able to attain efficiency. We have been able to run the whole cluster, CI cluster with less resource in optimal fashion, but still there were like few challenges. Let me go through them. So, when you look at the picture on the left top, this is the most difficult problem everyone, every DevOps will be facing. So, the developer comes in and says that, okay, my build is actually working on my laptop, but it's failing on the build system. That's when we come in and say that, okay, how do we solve this issue, right? We either have to go to the developer workstation, look at it and see, oh, there's something wrong over there. So, most of the cases what we found from the environment we will generally have on our build system. So, that's one issue which we have to handle. Even when we are using MISO's, there were certain resource that has to be shared among multiple processes, multiple tasks. For instance, there are a couple of builds happening on the same VM. They will be sharing the same Maven settings.xml. They will be sharing the same Maven repo. They will be, they have the capability to run into each other. We have seen cases where tasks, builds, were running into each other. That was another issue. And PayPal being such a huge company, we had such a huge big developer base and we have a lot of technologies being used internally, as Mani mentioned. We developed in Java, Node, Scala, C++, everything, all of those tags. And there are pockets of developers who also go and develop in Go. So, we have to address all of those teams, the build system for all of those teams. But different technology tags will have their own requirements, the toolset they want to use. So, there were cases where the toolset used by one technology were not really compatible with the other technology. So, we had to resolve them. And in certain cases where we pushed a particular tool to production, that was actually breaking a build of a different technology stack. So, that's another thing which we want to solve. The other thing was, yeah, as Mani mentioned, we were able to efficiently use the Mesos cluster, but we wanted Ubuntu 14.04, that's the requirement for us, we had our own toolset. As against some of the teams like staging, they have their own requirements, they need their own set of toolset. We can't really run the whole workload on the same cluster because of this issue. We tried to run it, but we were able to have a single Mesos master, zookeeper, or Aurora Cora, but the VMs itself where the workload runs has to be different. We try to attend this by using constraints. But that was not really the efficient way because we still were fragmenting the whole resources. The other thing was, as I mentioned, there were different technology stacks across PayPal. And let's say if some technology stack, they want to push in some tool. It was a kind of slow process where they need to open etiquette, they need to ask us to deploy the tool, test the tool. We would go ahead and test it out, ensure that the functionality working as expected, but at the same time, we want to ensure that no other build, no other technology stack build is breaking. So that was a kind of a slow process. We really want to be out of those decisions because that's really what frameworks they would take care of. And the other case was inconsistent build announce. With Mesos, in order to maintain the state of the VM, whenever we have to do any changes, we have to use Ansible for our change management. For most of the cases, it worked great, but there were instances where some of the VMs were not having certain tools set because when the change was rolled out, that probably the VM was down. So when there were issues, we had to go to the VMs and look at it and see what is missing. There used to be cases where the latest set of changes were not pushed in, or in certain other cases, it was kind of extreme and rare, but still we've seen where old changes were not there. So that was kind of totally distressing. We wanted to fix all of these issues. That's when we started looking at Docker and containerization. And after looking and studying it, we decided that Docker is the right tool for most of these problems, not for everything, for most of them. So what does Docker provide? Why do we want to use Docker? One is task isolation. Once you run your build inside a Docker container, the build is constrained to the container itself. It doesn't have access to the external to the container, we restrict it, so it doesn't violate anyone else. So that's provided a task isolation for us. And the other thing was, whenever we were doing builds, there was a level of confidence where the VM could be not in up-to-date state. But going with Docker, we pretty much showed that, okay, this is the particular image, Docker image we want to use for our build, and that's what we are using. Let's say a team comes and says that the build is not working. We can right away go and see what image is being used, whether it's the latest one or the latest one. We have a good understanding of what's wrong. In that sense, Docker was very much helpful. It also eliminates the dependency on the host because you are defining your environment within the container as a Docker image itself, so you don't rely on the host at all. You define your own environment. It was also helpful in reproducibility. As I mentioned earlier, there were cases where the user comes and says that my build is not working on CI, but it was perfectly fine on my laptop. So with Docker, we can reproduce an issue. Let's say the developer goes and builds it in the container from the Docker image with the right set of inputs, and if they are able to reproduce the issue, we should be able to take the same container image, same set of inputs, and be able to reproduce on our CI system. That's very helpful. The other case was portability. As we were, as we can use Docker images and Docker containers, we don't really care what is the host OS itself. We should be able to run the Docker containers on any host as long as it supports Docker. In the other case, the other good feature about Docker is immutability. Earlier, we were pushing the changes through Ansible to the VMs, but that was not really fair proof. With Docker, if there are any changes to be pushed, I would just go ahead and create a Docker image and configure the jobs or the CI to use the new Docker image, and that's pretty much. Either I'm using a particular new Docker image or the old one. There is no context of losing something. And that also provides me with an opportunity where I can set up a homogeneous cluster with a single OS across the board and have different workloads run on the cluster. It could be CI workload, it could be staging workload, it could be something else. As long as they define their Docker image, they are good with it. So that was very easy for us to manage. This helped us solve most of the issues we had before Docker. So once we decided that Docker is the way to go, we started dockerizing most of our applications, almost all. We have dockerized CI API, which is our orchestration engine and the CI provisioning request creates the CI and provides the CI to the user. We have dockerized the Jenkins master, we have dockerized Jenkins live. So everything is running in Jenkins in Docker so that we don't really rely anything on the host and it's very easy from our maintenance perspective. The Docker image which we use for Jenkins master, it's immutable. If you have to roll out a change, we just reprovision the CI version of the Docker image and that's pretty much. We are sure that all the changes are in there. It also eliminates the need for some tool to go into the VM or to the tool itself and install any of those tools. So you can just have your docker image and that will take care of it. It provides task isolation. There is no context of one user actually violating the other user. And the last but not the least, we can accommodate other workloads in the same cluster. We don't have to worry about whether it's compatible or not. Coming to dockerizing Jenkins live, we were able to support multiple tool sets, multiple stacks with the users and the teams providing their own docker images. That's very helpful. It was easy to roll out any changes because I don't have to worry about a new tool for a particular stack is impacting any other stack. I don't have to worry about it. And as I mentioned earlier with VMs rolling out a particular tool is very slow process. As against on docker, the user can create their own images. For instance Go is a small community at PayPal. They can create their own docker images for their Jenkins live and they will be able to build it after certification process from our end which is very much automated. So that's where it was faster than the whole process. So once we dockerized all of our systems we have to change our processes in order to accommodate the dockerization. For instance earlier we were putting our SSH keys, tokens SSH keys for let's say GitHub access or tokens for DTR access, even credentials they were on the disk itself. With dockerization we moved from the disk and we started providing these inputs at runtime, at build time itself. We will just inject them and every few days we will rotate the keys also so that they are secure even if they are compromised the damage is limited. And again as I mentioned earlier the JDK may when all of the tool set we don't really need to install on the host. They will be part of the docker container and it's up to the technology stack to take care of it and they are responsible and they have the freedom to make a choice. So using docker we have resolved most of the issues as I mentioned earlier, but there was still an issue. So at PayPal the environment is development development is really fast based. The user will expect the build times to be very fast. So if you look at a typical developer he will clone the repository for the first time and he will do a build let's say if you take a Maven build. The first Maven will download all the artifacts and it will be cached. Any subsequent build will only download the delta. So let's say if you take a typical Maven build the clone, Git clone might take about 27 seconds and Maven build which includes downloading the Maven artifacts from Nexus takes about 3 minutes. Overall let's say the typical build takes 4 minutes. But if the developer does the second build with very minimal change to the code, he doesn't have to download all the artifacts. He doesn't have to fetch the code. The build time comes down from 4 minutes to less than 20 seconds. So the developer will expect same performance even from CI. But the challenge here is because we are running on Mesos and using docker we are not really sure where the build will end up. The first build can end up on like XYZ machine. The second build may be totally entirely different on a different machine. So the clone process and the download artifacts have to happen again. That was not a great experience and there were a lot of complaints from the user which is understandable. So in order to address this issue we try to solve this problem by implementing our own solution. We have one of the most biggest OpenStack installation. We call it as C3. C3 uses block storage. They call it as Cinder. So we try to use the Cinder technology for moving the storage from machine to machine. We would attach and detach the storage as the build moves from one VM to the other. So that at the end of the build we just detach this storage and for the next build we just attach it so that the storage is still available and everything is cached. So we solved this problem for Cinder but we are still exploring open source solutions like Flocker and Rexray. Portbox is another solution. We are exploring because we want to have a generic solution for let's say on-premise cloud or even Google cloud or AWS public cloud. We should be free to move from one cloud provider to other cloud provider if you want to. So we are exploring those options. But this is still an open issue which we are working on. So with running our system on Mesos and running using Docker containers this is how our system architecture looks like. So when you look here, this is our control zone where the Mesos master, Zookeeper, Aurora all of our tool sets sit. The data zone, this is where whole of our Mesos slave reside. Everything here is Mesos and everything runs should be Dockerized. And we have other zones for our repositories and other zones for databases. So the way Flow works in here as many as I already explained. Altus is our front end for our developer community or their app creation or for CI creation or deploying their applications to the staging environment. They go through Altus. For CI provisioning, they come to Altus request a CI provisioning. Altus in turn request our endpoint. This is the load balancer CI. PaypalCop.com This is the load balancer which is front-ending our CI API cluster which is Dockerized and just forward the request. It's just a load balancer. The request goes in and CI API takes care of the request. CI API orchestrates the request. It works with Aurora. It creates the Aurora job based on a template, starts up the Aurora job. Aurora in turn works with Mesos master, gets the resource and starts up the Jenkins master. Once Jenkins master is up and running CI API defines the Vantui URL, for instance it could CI.PaypalCop.com slash CI name and provides this Vantui URL to the users. This Vantui URL is used by the users subsequently. Even if a VM goes down or if the CI has to move from one VM to other, CI API takes care of the orchestration and reconfigures the Vantui URL so that the user doesn't have to worry about where the CI is running. Each CI works with Jenkins Mesos master to get the resource for their build . This is where we are right now. So I'm going to go ahead with the demo of how things will work. I can go into the details of how we configured the Docker images what Docker commands we run and we can show where how they are running. As I mentioned here Altus makes a REST API request so CI.PaypalCop.com. I'm going to simulate that request which is not accessible from here. It's our internal tool. So this is a simple request with CI API, V1, CI and this is the CI name. I can just say 2. And this is a POST request. As part of the request I can define what is my application who is the owner of the application what kind of technology this application belongs to what is the GitHub URL of this application. This is a POST request. The response of 201 indicates that the request is accepted and is being processed by CI API. Using the same endpoint I will be able to get the status of the CI. It would be a GET request. So when you look at the response from the GET request you can see the status code is 131. 131 indicates that CI API is working with Aurora and starting up the Aurora job and the compute is not available yet. If I go to Aurora so this is the Aurora UI I can see the Aurora job here. It's available. And let's see if it got spawned at Mesos. Yes, it got spawned at Mesos. I can look at the logs by going to Sandbox. From here I can look at what is going on in detail. It will download the backup and try to start up the Jenkins process. It's already starting the Java process. So let me go back and see what is the status of the CI from CI API context. Let me put in a GET request. I see it's 150 still in provisioning state. So meanwhile the provisioning is happening let me show you what is the docker command we use for starting up our CI API with our orchestration engine. So we start the docker container and we provide all the inputs whatever are needed, the certificates tokens, SSH keys everything. We do it one time as a manual thing. This is how we inject the secrets. This will start up the CI API container. Let's look at the status now. The status now changed to 180. 180 indicates that the CI is created and it's ready to be used. Now let me go and start up the access to CI. So we use SSO for our authentication. It authenticates and it lands me on the homepage. So as I mentioned earlier these are the jobs which we use for our continuous integration process. These jobs are triggered whenever there's a pull request so we can look at the configuration. So we provide all the inputs and configure it to be run whenever there's a GitHub pull request raised. And this CI is configured to run whenever there's a new commit on the base repo. It will check for a new commit every five minutes. So when you look at it the CI, the job is already triggered for the initial base commit. Let me go to Mesos and see what's happening. So when you look at Mesos you can see that the CI has registered itself as a framework at Mesos which means that this CI will get offers and the CI will grab this offer and start up the build. So the offers are not yet there. Let's monitor the task. The task will show up when it's started. So as you can see here at Mesos you see the task get started. This is the slave task. And the build uses this slave which is created dynamically to start up the build. And it runs inside the container. So this is how the build happens. Whenever this is the whole end-to-end from CI provisioning tell the build. So that's pretty much from our end. So we can take any questions if you have. That's a good question. So we had a similar challenge to build Docker images within Docker. Right now we are using if you can look at here we are sharing the socket which is used by the Docker demand. That's how we are doing right now. We would like to have a better approach. We are still exploring but right now this is how it is. Yes. Yes. That's still an open version we want to explore. That sounds all problem for us. We are working on about one year back. 0.6.0. We are actually working with Benjamin in trying to resolve this issue. It's scalable to a larger extent right now. Because we have started with Aurora and it's working pretty fine. So we haven't started exploring Marathon but that's a different track we are working on right now. We want to explore because we have it already in production it takes inertia. Right now we are not at 2 but we will move to Jenkins soon. Money can take in. Actually right now we are using local disk to store all the file system that Docker with respect to Docker. So even with Docker right now we are backing up to and downloading it, restoring it at the start of the Docker container itself. But we want to explore better options. For instance, the stateful containers. If that story is solved, we are still exploring Flocker, the X-ray portworks. If that is solved we can just mount and unmount EBS volume. If it's AWS or any block storage. Exactly. That's still an open version right now and we will move it up to object storage. The benefit of getting a it's not from a build perspective the Jenkins master itself is tied to the application. So every application will have a Jenkins master. Each Jenkins master will have multiple jobs. That provides a flexibility for the users to have their own plugins Jenkins plugins or if they want their own Jenkins Docker image for their build. So that provides a flexibility for them for example if they want to upgrade to Jenkins 2.0 they can do it. Right now we haven't seen this and with Docker I don't see any reason to do that because the only thing they would want to have different flavors would be the CPU and the RAM. Otherwise it's going to be Docker image. Yes, of course. You are right in the sense that it was different Docker images for different jobs. We haven't seen that use case till now but that's a possibility. So right now all of these Docker images are all created with the same technology framework teams. So there are very rare cases where the individual teams or users would want to have their own image. Because we have streamlined stacks for let's say Java, Scala or Node. So the framework team responsible for those stacks takes care of the build environment because that includes tools for let's say running Fortify. So they actually deploy. So they are totally different things. We build them inside the container and we create a Docker image that will be published and subsequently deployed to the environment, staging environment on the same mesos cluster. So right now we are not providing access for the users but we are looking at options because Docker using for our CI is one use case. The other use case is the users want to run their own jobs their own let's say cron jobs or some processes which is independent of the build system they want to get access to it. So that's something which we are exploring. I think it's time out we can take questions offline. Thank you guys.