 OK. Let's get rolling. So today we're going to be talking about continuous integration, specifically rolling up and bringing on board a Zoolci cloud. So a little bit about me. In the US, I volunteer as an OpenStack ambassador. For those of you that aren't familiar, that means I help run the user groups and put together demonstrations and basically get donations into the OpenStack community. So this is one of the projects. I have a consulting company, started about in 99, provide various cloud consulting services as well. OK. So today we're going to talk about Zool. We're going to talk about continuous integration using Zool. A little bit about Zool, for those of you that aren't familiar with it, we'll go through why Zool exists and why it needs CI clouds, why more so than the other types of continuous integration, specifically why we decided to build and donate a continuous integration cloud and why the OpenStack community needs CI clouds, roadblocks that we encountered going through and building this cloud and things that were specific to CI. And then how we use those Zool tools that there are various software components, software projects within Zool and how we use them to improve operations. And then since then, how we've expanded Zool to add in a new functionality. OK. So first off, what is Zool? For those of you that are familiar, it's something from Ghostbusters. So there's the Keymaster and Zool is the gatekeeper. So in OpenStack terminology, Zool is what keeps track of what's coming in to OpenStack and what source code comes in. So Zool itself is a software project that's originated within OpenStack about three or four years ago. Since then, it's grown beyond OpenStack. It's now its own standalone project that's used by a number of other software projects as their CI system. So it's a combination of Zool and whatever type of repo you're using to keep track of your source code work together for the whole CI experience. I'm not going to get into too much detail about Zool specifically. I'm going to give it a high level. There are plenty of discussions the next couple of days about what Zool is. But as you go through, you'll find Zool is actually used out there quite a bit. So Wikipedia is using it. This is their cover page and their Zool status page. So if their changes go through their instance of Zool, and it's public facing, you can go to their website and get all the status of all the gating jobs within their Zool environment. OpenLabs, it's a Huawei project. They actually, they run Zool as well, so they have their own instance, and they're doing various checks related to their projects. And Software Project, the Software Factory. This is a Red Hat initiative. They're using Zool as well. So here are just three, and as well as the OpenStack installation. And then here in the next couple of days, some other private installations of Zool that are being used. So for those of you who are familiar, this is the code review, check and review process within OpenStack. And here there are actually some pointers where you'll see where Zool has run. So any time a change gets checked into OpenStack, Zool will actually see that change and go and run through a number of gating checks. The second arrow on the right there, you can see it says ZoolGate, and then those are the CI checks that Zool went and then it ran it. Down at the bottom, you can see the whole timeline of when Zool went through. So in this case, this was a configuration file change. I went through Zool, went to ran two CI gate checks, and successful completion of both of them. So within, as Zool goes through and runs, it does keep track of log file. And here's, I just circled, you can see some information about a node. A node is a virtual machine instance. It doesn't have to be a virtual machine instance, but with an OpenStack, it is a virtual machine instance. And later on in the presentation, we'll talk about bare metal instances of Zool as well. But a node is a virtual machine that is used to run those CI tests. So people are always asking, well, why are we using virtual machines? Why aren't we reusing virtual machines? Why aren't we just using static hosts and running the jobs in there? The idea behind running a virtual machine is you get a pristine environment. You know the configuration. You know that there hasn't been any changes to it. And so if you want to go through and rerun a CI test, hey, I just need a pristine VM. I'll rerun the CI test. I'll get the same results, or I should get the same results that I ran previously. So in the circle there, you can get some information about what cloud provider the virtual machine came from, the IP addresses, and the operating system disk image. And then also in the log file, you can see some stats of the various checks that were done. So in this case, it was a configuration file change. So a linter was gone through and run through. Really inside of these nodes, DevStack installation is run, and DevStack is run to go through whatever software tests need to be done. And for those of you who haven't seen, this is the typical Zool Status page for OpenStack. It'll go through its listing, all of the checks that are happening, the chain of checks. And you can go through and drill down and see. So if you submit a job and it fails, you can go through a drill down and see specifically why it failed. And actually end up at the log file and see, okay, why did my job fail? For those of you on Jenkins environments, maybe use Jenkins. You're sort of familiar with sequential continuous integration. So in this typical environment, this is a change gets committed. The CI job runs. Second change gets committed. It gets queued up until the first job finishes. So let's say it takes 10 virtual machines to go through and run all your gate checks that you want to run. And it takes two hours. So in this case, you've got to wait two hours until the next check can run two hours. But you're using a fixed amount of resources using 10 virtual machines at a time. If you have an eight-hour, 10-hour day, well, great. That means you can do five, six check-ins if they're all take two hours. So this is great if you have a small community, but OpenStack is a continually changing project. You have a lot of people submitting changes in. You don't want to do sequential continuous integration because you're never going to be able to get all the changes done. You want to get feedback back to the developers as soon as possible as to whether the job succeeded, whether the job failed, where it did fail. You don't want to have developers waiting around. So this model doesn't quite work in the OpenStack world. So, okay, well, how about the next model? We run things in parallel. As soon as someone checks in a code change, we immediately spin up a bunch of virtual machines. We go through and we validate the changes. Well, this works great. Developer checks in some code. We immediately spin up 10 virtual machines. The checks get run. We send them a result back. In the middle of that, a second developer checks in some code. So we go through the 10 additional virtual machines get spun up. His code changes, which I'm calling change B. They got run. Same thing. We're going to have concurrently running change C. And that's another 10 virtual machines. So in this case, we have 30 virtual machines running. We're immediately getting resources back to the developer. We're immediately getting results back to the developers as to what passed, what failed. The results, we're using three times the amount of resources. We've got from running 10 virtual machines concurrently to 30 virtual machines. But at the end of this, do we really know that the combination of change A, change B, and change C all put together work? And we've run the CI changes across all the A, B, and C. Unfortunately, no. Because what we've done is we've run code change A independently from code change B from code change C. So we don't have a run through of all three of those. So while this model does give the feedback to the developers quickly, it doesn't give us that final reassurance that all three of those code changes. But we're getting there. So how about if we do speculative continuous integration, where we go through a change gets committed, so change A. We immediately spin up 10 virtual machines and we start verifying change A. Then we have change B gets committed. But instead of validating just change B by itself, we merge code changes A and B together and we run the CI tests for A and B. And at the same time, change C comes along. So then we merge the codes for A, B, and C. And we run checks for that. Change D comes along. We run the code merges of A, B, C, and D. So now what we've done is we've used quite a bit more resources. Now we're up to 40 virtual machines running concurrently. And we're checking all four of them. And we're working on the assumption that everything is going to pass. Because before we've actually gone through and done this testing, we've gone through and done a one-off parallel testing of the individual code by itself. So A individually, B individually, C and D, each of them individually. And now we're combining them, code merged all together. So the end result is when C of the merge code A, B, C, and D pass, that we know that all that code works. The downfall, however, is if any one of these fails, then we have to throw away all the work that we're doing further down the line. So for example, if the code merge of A and B, if that test fails for some reason, then we have to kill the VMs and stop the CI work that we're doing on the merge of A, B, and C, and A, B, C, and D. So this is great. I mean, this gets feedback back to the developers. This gets code merged back in. We're running in parallel. But we're using quite a bit more virtual machines and more computing resources. And so that's where the CI clouds that Zulu uses becomes important. We need a big pool of computing resources to go through and be available to process all of these jobs that are coming through. So in the open-stack world, we depend on the donations of cloud resources from public cloud providers. If you're a public cloud provider and you want to donate some computing resources, great. You provide the credentials to the infrastructure team that runs Zulu, and they will start submitting jobs to your cloud through Zulu. And specifically through a node pool, which is the process, a software process within the Zulu project that handles communications and spending up on VMs. As my, I run a couple of user groups within the United States and PAC-IPS, it's a data center provider. They had been providing me computing resources to start up open-stack clouds to run clouds at these meetup sessions. So people could actually go through and use open-stack and try it all out. So they had an interest in donating more resources into the open-stack community. And we thought, hey, running a CI cloud would be a great way to do it. Now, in order to do this, you contact the open-stack infrastructure team. And they have a web page that goes through and details out all the requirements to donate computing resources. So at a minimum, each virtual machine needs to have, I just listed out here, eight gigs of RAM, eight virtual CPUs, a public IP address, and 80 gigabytes of storage. And you need to be able to commit to at least 100 virtual machines concurrently running. So if you're a private cloud provider or a public cloud provider and you want to donate, that's the minimum requirements. So PAC-IPS was kind enough to donate the computer resources to make that a reality. So they donated, we'll get to that in a second. But here's a status page that just shows all of the open-stack public cloud providers. So you see a lot of big names there, CityCloud, Vexhost, Rackspace. They're all providing multiple instances across all of their different regions that they're running. You can see right now, I'm going to say there's 16 independent cloud providers that provide all of the computing resources. The way that Zool and NodePool, NodePool doesn't have its own logo, so I'm using the infrastructure open stack infrastructure logo, the little ant there to symbolize. So NodePool is the software project that's responsible for going out and talking to all the public cloud providers. It has the cloud YAML file, so it knows all the credentials. And it talks to all of the clouds. It keeps track of how many virtual machines are running on each, how many it's allowed to, and it always makes sure that there's a pool of virtual machines spun up and ready to go, because you have a little bit of a spin-up time required, so that when Zool needs an instance, Zool can, they're ready and available for them. So NodePool keeps them up and running, puts them in an Apache Zookeeper instance that Zool then goes through and gets the IP address, connects into, and then injects the CI job. So in this instance here, I'm just showing that there are three public cloud providers in real life. There's over 16 now. This is the Grafana page. It shows status across all of the cloud providers. On the left there, we're going to see building. So 65 virtual machines across all 16 cloud providers are currently in a building state. That means NodePool has reached out to the cloud and requested virtual machines to be built in there, currently being built depending on the cloud. That's a two, three-minute operation. Once the virtual machine is built and up and running, it is put into a ready state, which means it is waiting for a CI job to be allocated to it. So it's spun up. It's ready to go. It's just waiting for Zool to SSH into it and submit the job. When a job is submitted, it gets handed over and it's put into the in-use pool. So oftentimes you'll see a ready count up high and then a number of CI jobs are submitted. The ready count will drop. The in-use account will go up. So at this instance time, there are 812 virtual machines being used across all of the clouds. And then once those machines are at the conclusion of the CI job, which typically is less than two hours, the machine is deleted and returned back into the public... returned back to the public cloud provider or private cloud provider for just general use. And the process repeats itself. Node pool sees that the quota has gone down, that it has resources, and it'll go through and start building up brand new virtual machines again. So, like I said, Packet had wanted to provide more than just what they were providing to the user groups. So they decided, hey, let's donate some bare metal. Unfortunately, they are a bare metal provider. They're not an open-stack provider. So they said, hey, John, can you get involved and can you provide the support for running the cloud on top of that? And so that's where my role as one of the open-stack ambassadors got involved. So we spun up about, I want to say 12, yeah, 12 compute hosts as well as some other support ones. We had a couple that are running some Intel Optane systems. Each of them, 24 physical cores, 256 gigs of memory, 2.8 terabytes of SSD storage. Brought them up. Now, the interesting thing is, I figure we can run about 360 virtual machines. Our limit, however, is IP addresses, IPv4 addresses, because Zool does need to connect in and Node Pool does need to connect in to these virtual machines, and they each do need a floating IP address across the public internet. So we're planning on moving to IPv4 so we can up that limit and move away from limitations of IPv4. So the cloud that we built, we ran all the typical open-stack services, so you're going to see Nova and Neutron. However, we didn't run any persistent storage. The idea behind that is this is a cloud that's dedicated for continuous integration. We don't need to go through, spin up any of the VMs, do snapshots of them. We don't need to store anything long-term. The VMs are going to get spun up. They're going to run for the couple of hours to run the test. And at the end, they're going to get deleted, thrown away. We don't need to go through and have any block storage. It was all just ephemeral storage. We also don't need to install things like heats. It's all VM-based, so we don't need any Kubernetes support, anything like that. This allowed us to use the computing resources to run as many CI instances as possible, as opposed to spending them on controller services. Interesting enough, Zool itself uses Zool to manage its configuration and CI. So I'm just going to walk through the updating of the Zool configuration files. So in here, we see there's a YAML file that goes through and lists out all of the data that's required. And we'll go into the configuration of this in a little bit. But basically, the first check-in was just adding in the new cloud, so the glance images could be brought online. Then they increased the max servers from zero to ten. So go through, make the change, Zool runs a linter on this, a CI test to validate that, yes, the configuration file is valid, that there's no syntax errors on it. It's approved by the core team. And then the change goes in place. The node pool sees the change to the configuration file and then it sees that the max servers is now up to ten, and it goes through and it starts spinning up VMs on the cloud. Then they increase it up to 95 and then eventually up to 100 machines. So then you can see that the packet US West cloud is online and it's listed on the dashboard. But that's day one, right? So from day one, it takes us into day two and continues operations after that and what sort of things do we run into? So it's been an interesting couple of months that we've been running this. One of the first failures we ran out is just mysteriously virtual machines wouldn't spin up. It turns out that the infrastructure team switched formats from compressed disk images to raw images. Every night they run disk image builder to rebuild all the latest images and then those images are pushed out to the mirror servers. There's one mirror server located on each CI cloud. So on packet we have a mirror server and the disk images are copied over there and then they're pulled into glance. So they decided to switch raw images. The way that I had the cloud configured didn't have enough disk space within that partition to hold those raw images. So interesting thing I think what it comes down to is make sure to keep in touch and communicate with your users so you know what sort of changes. Changes they're up to so you can anticipate these types of failures. The other next issue found is something that CERN found as well where we are spinning up and spinning down virtual machines faster than it's expected. So I have a stress test that I run through. I run a node pool instance where each virtual machine is only living for two or three minutes just basically to stress out the control plane and the quota system isn't able to keep up to date at least the default settings. So what ends up happening is the quota gets exceeded and things grind to a halt so you end up having to increase the quota even though you know that that number of virtual machines isn't actually running. So it turns out there is a setting in OpenStack where you can have quota go through and update its database more often and like I said CERN ran into this issue as well. Just when you're running a CI cloud the number of instances that is running and the number of instances that are dying and coming back up is much higher than you're going to see in a traditional cloud. Like I said each one of these instances typically lives for two hours and a typical normal cloud environment you're going to see virtual machines live for weeks, months, maybe forever. And a couple of other issues that we saw mainly having to do with configuration. Things weren't set correctly on individual machines which sort of took me down to a path where I need to make sure that every single physical machine that gets added into this cloud is identical and can't go through and make changes by hand. You have to have an automated procedure to go through. So what I started doing is adding and removing machines using Terraform. So for those of you who are familiar with Terraform, you can write out a configuration file of what you want your environment to look like. Terraform's got a bunch of providers that allow it to speak the different technologies so AWS OpenStack. They have a provider that talks the packet APIs. So I had a Terraform configuration file that goes through and will allow me to grow the underlying cloud by adding additional hardware and also I can shrink the cloud so I can remove hardware. So if I have a new configuration, for example, we had an MTU issue where the MTU configurations, the transmit units on the IP networking weren't set correctly and they need to be changed on each of the network devices. So rather than going through and changing the network configurations by hand, which is kind of dangerous, you change the network configuration, you change it incorrectly, you're going to disconnect yourself from a box and the box is halfway across the country. You're not going to be able to get back in it again. So what I do is I write up what the change is, put it into the configuration file. I add a new machine into the cloud. So I run Terraform, Terraform sees that I want to add a new machine or two on. It adds the machine on. It goes through, runs my script to set up the underlying bare metal machine. Then what I do is I add the machine into OpenStack, into the OpenStack cloud so I have the additional physical machine added on. And then I do the reverse. So then I go through and I take the physical machine and I put it in maintenance mode, basically tell OpenStack not to add additional VMs onto that machine. And since this is a CI cloud and each virtual machine lives for about two hours, I slowly see the number of virtual machines on that physical node reduced down to zero. Once it gets to zero, then I run Terraform again and I take that machine out of... The physical machine gets deallocated from my packet account and it's reallocated to packet who then goes through and they reimagine the machine and make it available to another customer. So it's sort of like treating physical machines like virtual machines in that rather than fixing the virtual machine, you go through, you spin up a replacement one and then you go through and you get rid of the old ones. So I did this for new RAID configurations, changes to network configurations, changes to disk partitions, or even if I just needed to grow and shrink the cloud environment. And one of the things I'm looking at doing is automating this so that if I see that there is a heavy load on the cloud and there are physical machines available, depend on different types, that the cloud will be intelligent enough to go through, update the Terraform configs, go through, add the physical machines on, grow the cloud itself. The workloads, you should start to see a very distinct pattern where Sunday evening, US time workloads peak and then basically Friday night they drop down back to zero and the cloud sits idle on Saturday and Sunday. So maybe it makes sense to deallocate hardware Friday night and then Sunday afternoon add the hardware back into the cloud to bring it back up to full capacity. So KPI, so key performance indicators. Node pool provides a great graph here. For me, from an operations perspective, it's hard for me to know when things are running successfully. The last thing I want to do is have someone from the OpenStack infrastructure team that's running Zool contact me and say, hey, John, you know, the cloud isn't working. Some things, you know, there's a problem. From here I can go through and I can get some key values out. And I found these to be very, very valuable. Time to ready. So time to ready is how long node pool is recording it takes for the cloud to go and spin up a virtual machine. So you want to keep this number as low as possible. So when you start seeing it get to 7, 12, 11 minutes, obviously you know something's wrong. You want to keep it down below three minutes for the time it takes a virtual machine to start up. So that's a red flag. Okay, something's wrong. The next one over, error node launch attempts. That's node pool is connecting in, trying to launch a virtual machine. For some reason the virtual machine is failing. It's keeping track of the number of failures that happen. So obviously you can see something happened here on October 2nd, 3rd, where the time to ready increased and the error node launch count increased as well. So some type of error. So then that kicks off. Okay, let's go dig in and find out specifically what's happening. The other metrics I like to keep track of is how long each API operation takes. So this is create server. This is each individual specific API call. So that will tell you if your controllers are responding quickly to API requests. Okay. So how do you go through catch operational issues of running this type of cloud and you don't want to rely on the Zool team that's running. They're busy running their own CI stuff. I'm just responsible for making sure my cloud runs. Well, how about if I run my own node pool instance? Basically running just the component within Zool that is responsible for spinning up and tearing down virtual machines. So went through and did it. And node pool has a pretty easy configuration file. You can go through and set it up. So what I did is I wrote up a configuration file that said max hold age. Max hold age is how long a virtual machine can stick around for. I said, okay, five minutes. I don't want these. This is just a stress test. Max servers. Okay, I want to make sure I can hit 100 virtual machines. So I put max servers to 100. Minimum ready. I always want 100 virtual machines ready. 100. And then the rate. How quickly it should submit API requests to. So install node pool, install zookeeper, write a configuration file. So that is connecting into the cloud and just let it run. Let it run, go through, run it as a stress test. Can your cloud actually run 100 virtual machines at the same time? How fast a rate until it, you know, keels over? Where does it keel over? Is it, you know, are you going to run into a quota issue where a quota is not updating fast enough? Are you going to run into, you know, or are you going to identify one particular compute node that's acting up? And when you take a look at your open stack dashboard, you see that that compute node has got one tenth of the number of virtual machines that the other ones have. But this is a great way to go through and stress test the cloud without having to rely on the node pool team. I had run rally previously. And for those of you who aren't familiar, rally is an open stack tool to test various performance. But this way I'm using exactly the same tools by running node pool that the zoo team is using. So I get the same experience that they're experiencing as well. The one word of warning is you don't want to run this at the same time that they're running their production workloads. For example, this requires floating IPs. I have a limited number of floating IPs. So while I have plenty of compute resources to run all the virtual machines, I don't have enough floating IPs. So keep that in mind. Or if possible, you can configure your node pool instance just to use private IP addresses if you can do it all within a private IP address space. So, okay, so the next thing I had is had some users come up and say, well, we want to do continuous integration, but we have very specific hardware or we need to do performance testing. And we don't want to run it in a virtual machine. So software for open stack, great, run it in a virtual machine. You can run DevStack in the virtual machine. You can test it all you want. But we did run into some specific CI tests where we did need to have some sort of bare metal continuous integration, but we still want to use Zool. So node pool has an abstraction layer within it. It's called the packet, oh, sorry. It's called the node pool driver. And what it allows you to do is write your own Python code to talk to whatever technology you want to. So you could have a talk to AWS. You know, there's an open stack driver for it, obviously. There's a static driver which basically keeps on connecting into the same machine and just reusing the same machine. So what I did is I wrote a node pool driver and included in that allows it to talk the packet APIs. So what we can do now is Zool will go request resources. It gets the credentials back, but it happens to be an actual physical machine and it can go through and it can run its CI tests that are once there on the physical machine. People get their performance testing that they want on a bare metal machine and they can have access to whatever specific hardware specialties are on there. Okay, so in conclusion, you know, donating to open source is rewarding. The open stack is rewarding. There are a bunch of different ways to do it. Most people go through, they write source code, documentation, but here's another way. You know, get involved by running a cloud or donating the computing resources to do it. If you're a cloud operator, KPIs, I think node pool is a great way to get data back out of your cloud. Time to ready, failure rates, API responses. If you're not already monitoring those, I found that was a great way for me to stay ahead of things. Automates. Like I said, initially we ran into quite a few issues where we just didn't have all the physical hardware configured in the same way by going through and automating some sort of bare metal deployment of the physical systems that helps reduce those errors. And then Zule is pretty flexible. Like I said, you can use it for stress testing a cloud, you can use it for operational monitoring, and then you can extend it through some of the abstractions they have to things like bare metal CI. Okay, well, that sums it up. Thanks, everyone, for coming out. I hope that you're having a great time here at the summit. So if there are any questions, we do have microphones, and I'll be sticking around here afterwards as well. So I'm happy to post the slides.