 I'm Eric Peterson, I work at Time Warner Cable, this is Matt Fisher. We also, today we're missing Clayton O'Neill, Clayton wasn't able to make it today, but we owe him a big thank you for putting this presentation, a lot of it together, proposing it and doing a lot of this work. Matt and I are both principal engineers and I do a lot of the horizon stuff, Matt does a lot of automation stuff as well. So a little bit of background about Time Warner Cable and our journey with OpenStack. We started in November of 2014, there's a team of about four of us, so it's a fairly small team and as you can imagine when you're first starting with OpenStack you've got a lot of decisions to be made as far as what is your network going to look like, what is your storage going to look like, picking out vendors, all kinds of different things that you need to go through as you're deciding, you know, how is our OpenStack deployment, how is it going to look like, what's it going to look like. Another piece of information here, we went to production in July of 2014, last year at Atlanta we were moving from, what was it, Havana to Ice House, so there's quite a big move that we were going through last year to try to upgrade everything, get everything ready for real production deployment. Something that's kind of interesting as well is our CICD infrastructure didn't really start to mature and develop until we were actually in production. And so it's kind of maybe doing it again, maybe we would have that maybe before we were fully in production, but it's important to know that once you get OpenStack in production you need to be able to figure out how do we make changes to this, how do we configure this, how do we keep on pushing new features forward, how can you do that in a way where you're not going to impact users, where you can keep things running smoothly and keep everybody happy. Some more information here, some numbers for our CICD stats. We've got over 4,300 reviews on our Garret instance, so this is an internal Garret that we have, same kind of thing that you use when you commit code upstream to the OpenStack review system. We've had over 300,000 node pool slaves spun up. So these are individual machines that run jobs for us, anywhere from like checking code, compiling things, doing some QA stuff. These nodes could even be used during a deployment also as well. We've done hundreds of deploys, thousands of line of puppet, Ansible Python, tons of that stuff. I don't even know how we could even count all the different stuff like that that we use. So this line here, over 16,500 lines of upstream changes. That number is probably a little bit old. I think a week or two ago we were over 17,000. So we try to contribute back to the community quite a bit, as far as changes that we see that we think everybody would benefit from. We try to share our work. This is a little bit more of an overview about what we've got for our production and our staging environments, our developer environments. You can see on here we've got production and staging over there on the far right. They look fairly similar to one another as they should. Over on the left we've got developer environments and these are more like a virtualized version of this staging and production stuff. And in the middle we've got Garrett and Jenkins, our infra stuff that we call our infra that help orchestrate all these changes. So I'm gonna go and drill down a little bit more into what we've got, into what our production environment looks like. So we've got two regions, you can see in the large periwinkle squircles. That's a UI term, Matt. So we've got two regions and up at the top we- That's right, periwinkle and squircles, both, yep, I'm the horizon guy. Yeah, so at the top we've got Keystone and Horizon. They share a global glare cluster that's shared between both regions there. So those nodes, you can tell we've got three nodes in each region. Seems to be fairly straightforward there. We've got HE Proxy, which helps front all of our API endpoints. And then we've got our control nodes. Our control nodes down there, that's really where a lot of excitement happens. And that's where Nova, Neutron, Cinder, Glance, all that good stuff is going on. And you can see also we've got the clusters there for my SQL and Rabbit and just note that those are just within the region. They do not span the region. We've got compute nodes and stuff nodes down below, we've got more nodes and more types as well. This is just kind of the basic common node types that we kind of refer to when we talk about different components within our cloud. And then at the far corners there, we've got our build servers. That's our Puppet Master servers. They also run a lot of our Ansible jobs and help coordinate things. They also work as Cobbler servers. So when we're bringing in new hardware, getting a new machine kind of bootstrapped and ready to go. They work with our build server to get a lot of their information. So stepping back to this previous slide that we talked about. First we're gonna talk a little bit about this virtualized developer environment. And you can kind of see that the staging and production blue boxes there, they kind of look the same as what we've got over there on the left hand side. And so we've got a virtualized version of our production cloud that each developer can deploy and use and do whatever they want to with. So a lot of the developers will have a project. You can kind of see the team member box up there, squircle. You can view that as kind of an open stack tenant or project. And then a developer is gonna stand up a lot of their infrastructure. And they can even have two regions within their development environment. A little bit more about our virtualized development environments, they're based on Vagrant. And there's an open stack provider for Vagrant to help you stand up VMs. One of the things I hope that you're seeing here is that a lot of the environments are built with the same tools. So when you stand up your development environment, you're using a lot of puppet, a lot of Ansible, same kind of stuff that we're gonna use when we get to production. And so developers should be doing the same kind of, the same mechanics, the same stuff using the same tools all the time as they go from their development environment into kind of our staging, into our production. All these things should be consistent. Team members can have multiple environments. So yeah, as many as you want. And you can kind of pick and choose which node types you want. So this really depends upon the task you're trying to do. So if you're trying to change something within Keystone and you've got your own virtualized development environment, maybe you only need to stand up some Keystone nodes and a couple of other things. And you don't really need to be too concerned with compute nodes. So you can really pick and choose kind of what you need in your virtualized development environment. The other thing is that they're shareable, right, for trouble shooting. So if I have a problem in my environment and there's some crazy puppet thing that I don't understand, I can give it to Matt and say go take a look at it, I don't understand this error. The other thing that we'd like to talk about a little bit is our Garrett system. So Garrett is the same tool that you use when you submit code upstream to OpenStack. So you put code up for review, everybody else can take a look at it. There's some extra jobs that run on it. So we have our own version of Garrett, a lot like what exists upstream for OpenStack Infra. Some of the different aspects about doing code review. And I'm sure that if you contribute code upstream, you can vouch for a lot of these things. But it helps the code quality. I know if I've got to put code up to get it checked in and somebody else has to look at it, I'm going to take a little bit more time to make sure I don't make too many obvious, terrible mistakes. The other thing is mentoring, shared ownership, and also some pre-merged testing, some automated testing to be able to look at your code and kind of be able to do some tests to say like, is your change, does it look reasonable? Is it going to do the things that you expect it to do? One of the important things, if you're going to run a Garrett server like that, you're going to have this process. One of the things that's going to be very important is to have your own Jenkins instance to be able to run a lot of automated jobs. So we've got our own Jenkins instance. And it actually runs the build slides that are actually supporting Jenkins. They actually run in our staging and our production environments. So we're eating a lot of our own dog food here and we're kind of our own consumers as well. So something that, another tool that we've used with Jenkins is node pool. So I'm not sure if everybody knows about this, but node pool is a, it's an open stack infra component that you can use. And it enables Jenkins to have many, many build slaves as many as it needs. It can spin them up, it tears them down. It does all kinds of things like that. So early on that number slide that I went back to, I think it was 300,000 build slaves, those are all served up through node pool. So node pool continually gets a VM ready for Jenkins to go do some work with. The other thing that we noticed there was, we encountered an issue where node pool would try to reuse the same virtual machine over and over again. And so we had to write our own Jenkins plugin to help with the scheduling of node to make sure it doesn't get reused. So node gets used once, it performs a job, whatever you need it to do, compile, whatever, and then it gets thrown away. And so this Jenkins plugin that we have is on our GitHub repo and it's offered for other people to take a look at, reuse, do whatever you want with. The other thing about Jenkins is when you reach a certain scale, when you're first starting with Jenkins you might use the UI to build a couple of jobs, you'll go into the, it's got a nice little GUI, whatever, you type some things in, it's fine. When you get to the point where you start to have hundreds of build jobs, this UI stuff is no longer a tenable way to get work done. So we use Jenkins Job Builder, Jenkins Job Builder. It's a YAML based file format and you get to declare your jobs. You can reuse a lot of the same kind of macros and different things like that. And so it's worked really well for us to be able to create a lot of different job types, have a lot of reuse and a lot of commonality. To be able to scale up to really some big, large complex Jenkins setups. Our Jenkins Job Builder configurations, they are also, even when you wanna change the job config, that also generates a review. And what we check in to get for our Jenkins Job Builder configuration files, that's also authoritative. So in other words, Jenkins is gonna perpetually kind of continually go out and check, get and say like, do I have the right job configuration set up? Am I deployed? Am I configured correctly? And we've got that all enforced and set up. So with that, that's next for Matt. Okay, I'm gonna talk a little bit about the automated testing we do. Eric alluded to a little bit of this earlier. The first is pre-merge tests. So anytime a change set is uploaded to Garrett, we kick off pre-merge tests. And we use this during using the Garrett trigger plugin for Jenkins. This is I think how upstream works. And if you've done upstream code, you know when you check code in, tests get run. We run some basic tests, puppet lint syntax, basic unit tests, we run talks tests, the type of tests, of course, depending on the repo you've uploaded if it's puppet code or Python code. One thing that's a little unusual that we do is puppet catalog compiles and diffs, puppets are really big part of our process. And so we have some of a specialized tool that Clayton really worked a lot on. And I want to kind of go into more details on that. So I know not everyone here is a puppet person. So some basic background, puppet, the puppet build server, puppet master essentially compiles a catalog for every node. The catalog consists of things that puppet is going to do. So packages that need to be installed, services that are supposed to be running, configuration options, things like that. To get that catalog, the input is puppet code, which we haven't get. Puppet config, which we use higher for, which is also in get, and something called fax. And fax are bits of information about a node that are specific to the node. Like an IP address, a host name, how many CPUs it has, the manufacturer, things like that. We actually collect those fax every day and store them in a database. And so since we have the code in get, we have the config in get, and we collect these fax, we can let Jenkins do a before and after catalog compile for every single node type in our environment. So what does that get us? Any time you post a piece of code up or a hierarchy config up, Jenkins goes out and builds a bunch of catalogs on the current state of the system. Then it applies your change, and it builds all the catalogs again. Then we use some RIPaneers, and apologize for the pronunciation. Puppet catalog diff module to essentially generate a diff and tell us. Eric promised us that this was, puppet change was only going to install one new package, and I see 30 other things in there. We might have to have a conversation about Squirkles or whatever Eric likes to talk about. This is an example, it's a pretty basic example. This was from a massive refactor that I did recently. I think this is over a thousand lines of code. Most of them were deleted, some of them moved around. What's interesting, if you look at the arrow, you can see the following diff files were generated and the list is empty. So for a refactor, this is exactly what you want to see. No diffs, no changes. We see this and we're pretty sure that we can just probably deploy this and we're not going to have any problems. And this was deployed a month ago, so I'm pretty sure it's working. For locally generated changes, diffs are pretty cool, and what the real benefit for is for external modules. We have over 70 puppet modules, a lot of the open-sac ones as well. We try to keep them updated at all times, but there's just hundreds of commits. And if you want to go update these modules, even if it's just ten of them, and you want to keep them up to date every week, you just don't have time to read every single diff. And so we really rely on this catalog diff to tell us what the heck is going on. What are these 50 changes in Puppet Nova? Are they going to break us? Let's go look to the catalog. You can see here that we actually have a diff list. We see this Puppet Nova change is going to affect control and keystone nodes. So let's actually dive in and look at a real diff from a control node and see what is in this change set. This is the change set from a control node. Again, there's two sets of changes here. They're highlighted in yellow. The bottom two set is the default database collate type changing. This was a discussed upstream in the Puppet mailing list and in meetings. So I knew about this change, and this was fine. We were okay with this. The other change is the no-vnc-based-proxy URL. I actually not really sure what that is, and it might be something that would break us. I need to know is it changing the default, or is it removing it, or what? So this is one where I'm going to go back to this module and look at that commit for the no-vnc-based-proxy URL. But the other one I'm cool with. So what are the pros and cons of this tool that we've created? The first real pro for us is we can validate all the code in all of our environments. We do a rolling deploy to dev staging prod. We don't want to find out that a change we made breaks prod when we're in the middle of a deploy. We want to be sure that this catalog is going to compile on a production server and not have to debug it later. Finally, if you compare this to simple syntax checks and unit tests, we get way more detail, more information, and it's a much more valuable tool for reviewers. Right now, in terms of speed, we generate before and after catalogs and diffs for over 50 node types. And we can do that in about four minutes. That's way faster than an integration test. It's not an integration test. But we don't want people waiting hours to be able to check code in. So this is kind of a good pre-merge check start for us. There's a couple of things it doesn't do. Puppet modules also tend to have a lot of Ruby code for providers. You can't debug any Ruby code in this. You also, this one's very important to us. You don't know when a puppet change is applied. It's not going to tell you that it's going to restart a service. It's not going to tell you that it might upgrade a package or install a new package. And finding out things like rabbit is going to restart. We need to know that before we go to production. So that leads into the next kind of testing we do, which is our integration testing. About two months ago, we started doing some basic integration testing. We basically build up, using the virtualized environments, our base node types. The Puppet Master, which is called a build server, a load balancer, a keystone node, a control node, and a compute node. We're basically just testing to these nodes build. And what I mean build is, you spin up a raw VM and Puppet runs and lays down everything. So on a keystone node, that's a full keystone environment, full MySQL, and everything. When Puppet is running on these nodes, Puppet is actually talking using the OpenStack APIs. So it does give you a slight good sense that OpenStack is probably going to work on these nodes. But we're not really in depth into what level of testing we do here. We do this using the node pools, multi-node slaves. And we do some parallelization here, but there's a bunch of dependencies. You can't bring up the load balancer node until the Puppet Master's ready. And you can't run the control node until the keystone node's ready. So we've actually kind of written some Puppet code that'll bring these nodes up, and if keystone's not available, it'll essentially pull until the keystone node is ready to go. Right now this takes about 45 minutes. And again, it's a post commit, not pre-commit, which is not ideal, but we've actually found bugs this way and caught them before we go to production. What we really want to do in the future with this is add a bunch more testing and be able to go look at the Puppet runs and figure out this service is going to restart. This package is getting it upgraded. Those kind of things. What we'd also like to do is this runs every hour right now. So what we'd like to do is anytime someone makes a commit to master, let's say we queue up four or five of those, we run this test. When this test passes, it automatically gets deployed. That would be our eventual goal once we have more testing here. I do not understand the animations on this slide, sorry. You need some help with squirtle. We've covered our virtual environments for deployment. We've covered code reviews, Garrett. We've covered our pre-merge and post-merge testing where we use Jenkins and node pool. So let's talk about how we actually manage this code for Puppet and for OpenStack. For some Puppet config info, we have a Puppet master per region, per environment, so that means we have a prod East Puppet master. We have a prod West Puppet master, same for staging and dev. We're not using the dynamic environments. We just run everything off a single master branch for now. We minimally use Puppet DB, SSH keys, or SSH known host keys, and some Icinga exported resources go there. We use HiRA for environment-specific changes. This might be a prod versus dev change. This might be an East versus West. East and West have different NTP servers, for example. For any secrets like passwords and SSL certs, we use HiRA e-AML. We don't want decrypted secrets in our get tree. And R10K is actually what's responsible for downloading all the modules and installing them and maintaining the right versions. So how do we do that? We have a repo called PuppetConfig. This is a single master repo. One of the big pieces here on the left you'll see is our HiRA data. The other two important files in here are Puppet files. Once you combine these, this is really what controls how things get deployed and what gets deployed to a different environment. If you're not familiar with a Puppet file, it's essentially a Ruby file that contains a URL to a get repo and a tag or commit number. And tools like R10K just walk through it and essentially get clone the URL and get you on the right version. I'm going to talk a little bit more about Puppet file YAML later. That's kind of something we added on. First, let me go through. This is what a Puppet file looks like. I'm not going to explain all this, but it looks pretty basic. You can read it and tell what it's probably trying to do. But what about Puppet file YAML? Why do you guys have that? What's it for? What earlier I said, Puppet file is actually Ruby code. And so we added the thing at the top of the Puppet file to dynamically load a Puppet file YAML so that we can have both of these. And the reason we have Puppet file YAML is that in the old days, if you checked in something, if you updated the Puppet module, your change got accepted into master. But it wasn't going to get deployed until you made a second change in this repo, changing the repo pointer that I talked about earlier. That's kind of annoying. It's actually really annoying. So we actually wrote an automated Jenkins job that any time a Puppet change is made and checked in, the Jenkins job goes back out and modifies the Puppet file YAML. This format for a file is just a little easier for us to deal with. In addition, it allows us to place all this metadata in here so we can see who's the last person that touched this module and what was the last change that landed. So we can tell what possibly broke if we're having a problem. In terms of the actual deployment, the only thing that R10K is using is that Garrett URL in the ref. Everything else is just for our purposes. So that covers how we do Puppet. But what about OpenStack code like Horizon, HeStone, et cetera? I'm going to let Eric cover that. Yeah, deploying OpenStack code. So we run on Ubuntu, and we use Ubuntu as the Cloud Archive repo for all the packages. So when you're installing things with a Debian package, it's nice in some regards. It installs things, it manages versions, it gives you some default configurations. It'll start the service for you. It'll maybe set up the database in some cases, whatever. Some of those things are quite nice. Sometimes they're not as nice. So in our case, we want to control if a service is going up or going down. When is it going up or going down? And if it's going to touch a database, we want to know when that's going to happen, if it's going to be changing anything about the tables. So we've had quite a bit of success using the default packages from Ubuntu and canonical, but they haven't always been completely ideal in every situation. So the first thing we did is we tried to build our own version of Keystone. Matt's lucky he's got a background working with building Ubuntu packages. So he's very knowledgeable in this stuff. And this was a task that he took on. But building your own Ubuntu packages, or Debian packages, whatever, it's not without pain. So you've got to chase down all the dependencies. You've got these archaic build tools. It can become quite a time sink after a while. One thing we also used is PBuilder. It was driven by this Jenkins, Debian Glue that we found is a helpful thing to help us make progress a little bit faster. So that all went pretty well. We were able to build our own Keystone packages and get those deployed. And now it's time for me to maybe disclose I've been working on Horizon for maybe three years, maybe a little bit more. And I'm familiar with doing Horizon in a certain way. And so when I deploy Horizon, I usually put it in a virtual environment. And you'll hear virtual environments and containers and all these different similar concepts talked about quite a bit at the summit. But when we deploy things, we deploy Horizon in its own virtual environment. And so when we do that, we set up the Horizon base known, gets Apache installed, and some basic configuration. And that's all done via puppet, basic stuff. And then Ansible will come along. And Ansible will go out, and it'll set up a virtual environment for Horizon to run in. And then there's this last little tricky step with a sim link to actually deploy Horizon. And so what does it look like, if I can show you an example, this is what one of our Keystone nodes looks like. And as you can tell, you can look up there and you can see maybe 10 directories, something like that. Each one of those is a Horizon deploy. As part of my deploy job for Horizon, I make sure that the last 10 are retained. And then when you look down there at the bottom, there's a current and that's a sim link pointing to the most current deployed directory. So by doing this, I'm able to deploy more than one version of Horizon at a time. If I have a bug or there's a problem, I can switch that sim link out. I can have it done in a matter of seconds. I can go forward. I can go backwards. I can deploy things within a matter of minutes, if not faster. The next thing that we tried to do with virtual environments, we've started picking up designate and deploying designate in our environment. So designate was kind of this new thing that we weren't really sure about. And so when you're deploying a new component within the open stack, this behavior that virtual environments and isolation that they provide is kind of an appealing thing. And so we've done that with designate. The designate work was largely based upon the puppet designate work that's upstream. So that was a fantastic starting point to get us going. But what we did is we've kind of forked it a little bit, kind of taken our own little patch to make it so designate can run in its own virtual environment as well. This is a patch that we're kind of holding back right now, just because it adds quite a bit of complexity. It's not really necessarily something that the community might be ready to grab on to 100%. But we'd certainly be welcomed to, we'd welcome any discussions and be interested to participate in moving it forward. The other thing about using virtual environments is you're going to need to take on some more tooling things too. You're going to need to have maybe a way to capture some, like a Pi Pi mirror or something like that to help make sure that you've got all the packages that you need and manage those things. So I think we promised you, we actually talked about getting code on the boxes at some point. So we're probably going to go into that. We've actually walked through our whole process now and we're ready to talk about getting Eric's virtual environments or packages onto environment. We deploy to a shared dev environment to the middle screen, that's five or six times a day as needed. And then we do a boring, weekly deploy that never has any problems. That goes to, my team's laughing, that goes to staging production generally weekly. We do those based off of tags. So we make a tag and we deploy off the tags where we know we're at a certain point in time. We know exactly what we're going to get and we're not holding back master. But how do we actually do this? Our starting point is Jenkins. Jenkins provides a great UI and Jenkins is really used to drive Ansible. It gives us access control, audit control, and allows multiple people to sort of share and looking at the output, what went wrong or what went right. We also use Jenkins pipelines to handle the two regions. So press the button on one region, wait for it to finish, and then we go on the other region. Ansible is actually there underneath. That's what Jenkins is wrapping. Ansible updates the puppet repos, gets all the new modules for us, handles node ordering, which I'll talk about later, and runs some pre and post health checks. You may remember Eric's discussion of this slide beginning a talk. I'm actually going to walk through this slide sort of piece by piece and tell you when we have a piece of code, where does it start and then who's first and then what's the ordering. So the first step here is we turn puppet off on every box in both sides. We let the run finish and we completely turn it off. We ended up having to do this because we learned a lesson basically due to the shared Keystone Horizon Cluster. If you update one region and you put new code on it and you run Puppet on your Keystone node, it might be doing something like changing an endpoint. The other region Puppet's still running, it's got the old code, it's going to be changing the endpoint back to what it thinks it should be. And back and forth, and it's essentially Puppet arguing with itself. So the first thing we do is shut off Puppet everywhere, wait for all the runs to complete, and we're ready to go. The deployment starts in the build server. That's where the Puppetmaster runs. Ansible updates that Puppet config repo we talked about to point it at a specific tag. And then R10K walks the Puppet file that we talked about and checks out all the right versions of all the modules that we want. They go to the pin that we want them to. After that, Ansible runs Puppet on this node to bring it up to date. Hopefully if anything really horrible has gone on, it blows up now. This is a box that frankly we could lose and rebuild in 20 minutes, and it's not a problem if we completely mess it up. But build servers actually usually don't have any problems, so we move on to the next node. We go to the load balancer first. One load balancer. We don't want to do two at a time. We don't want HA proxy or keep live D restarting at the same time. And even better, if we break the node, we want to break one node and not both nodes. Once that works, because load balancers usually always succeed as well, we go to the second load balancer node. OK, now we get to the interesting part. We hit the first Keystone node. Now Keystone has this Galera cluster, which we talked about before. So we actually have health checks built into Ansible to check the health of the Galera cluster before and after the Puppet run. We want to make sure that if we've messed up Galera on this node, that sort of doesn't spread across everywhere else. Puppet runs, the node's updated at this point. And we basically do the same exact process on the remaining two Keystone nodes one at a time. OK, more interesting thing. So the control node is very similar to the Keystone node. It's also part of my SQL cluster. We also want to run nodes one at a time. We really want to make sure, especially here, because we might be changing like a neutron config that everything is good. So we checked status before and after. And we roll through these nodes. For the compute nodes, the compute nodes are actually pretty safe. There's actually not a lot of packages on here and not a lot of services. And we have a lot of them. So this is a case where we do run in parallel. We think we run 20 at a time. And it's just because we don't have all data to make everything serial. And we haven't really had any problems with compute nodes. We do stuff nodes. We do these one at a time currently. We just don't have enough, we haven't put a lot of time into the ansible checks. I think if we did, we could parallelize these a little bit, especially as we grow our self-cluster. This is taking longer and longer. That's something I'd really like to make better. I'm not showing you some nodes in the diagram. I'm not showing you Swift nodes. I'm not showing you our single node. There's probably a couple others, maybe. Manaskar nodes. We just ran out of space. Once that is complete, and this region is complete, we run some automated verification tests. At this point, if we've blown up this region, we still have a little bit of hope. We still have another region that we don't want to blow it up. So we want to make sure everything's valid. VM's spin up, everything looks good. I'm assuming there's no problems found. You just go to Jenkins, press the button, roll through the next region. I'm not going to go one by one, but it basically follows the exact same pattern. Again, we finish. We do a full validation again of both regions to make sure we didn't break anything. We do have some options when we do deployments. We have the ability to deploy to specific nodes, so that process you saw walking one by one can take a while. So we want the ability to say, this change only affects the build server. Deploy both build servers, leave everybody else alone. Or if a deployment fails for some reason, we say, we got halfway through. Let's restart at this point and not waste time. We can also force Galera or Rabbit to restart. We've done this during upgrades or my SQL changes. We also have the ability to trigger a managed reboot. So Puppet will sort of, especially for updating the kernel, we will update the kernel, reboot the box, and we'll wait until the box comes back online until we move to the next box. So that's our process. We actually have a lot of things we'd like to make better. The pre and post health checks at Ansible Doves are really cool, and we really just haven't spent enough time on them. Rabbit Cluster, we have a lot of problems with. I'd love to know that we've blown up Rabbit before we've blown up the entire cluster and people start getting paged, and that's the next inevitable thing once you blow up Rabbit. So Ceph's the same thing, Ceph health checks. Brian's going to get those for me. And we'll find out if we've changed something to Ceph and the same with Swift. The other thing I think our team would really appreciate is more parallelization, just to make this whole thing faster. Ceph nodes again, Swift nodes probably as well, could be run in parallel, especially as you grow your cluster. Serialization just becomes painful. Finally, we really want to be able to do more targeted deploys. So I talked a little bit earlier about integration, my integration test, looking at the catalog and saying, this is changing, this is not. I would love to be able to tell from a deploy, this is only going to affect Swift nodes. And I'm 100% certain of that. And therefore, we can only deploy it. We only have to deploy to Swift nodes. And if we had that ability, the guy who owns our Swift cluster, he could do all the deployments to his cluster. We could say, yep, it looks good. He can do the deployments. He could be responsible for checking the health. He could manage all that. Right now, we just don't have that level of visibility, we're trying to work towards it. Finally, the release note process for us. What is in this deploy? What are you changing this week? Manager asks us that all the time. This is a painful process that our team has to do. You involve looking at JIRA, reading Git logs, harassing engineers who didn't properly make tickets. And this could really use some automation tooling. OK, what you've seen today is our entire process from beginning to end. You spin up a virtual environment. You make a code change. You submit it for review. Premerge testing, integration testing, and then deploy to the shared dev staging prod environment. And then finally, it's verification testing at the end. So hopefully, at least some of this process is interesting to you, or maybe you have the same problems we do. So that's it. Maybe there's any questions? We'd be happy to answer them. I know you have a question. Can you put the slides up anywhere yet? I have not. I'd be happy to put the slides somewhere for you, Med. I think you're thinking of Mike Dorman. We definitely use a Puppet Master. Yeah? Yeah? Yes, that's fine. But we do use a Puppet Master. Yep. You talked about Ansible and Puppet at the same time. Can you clarify for which tasks you use Puppet, for which it's better to use it? That's a really popular question. So we feel we use each tool to their strengths. We use Puppet to configure servers. We use Puppet to provide things that need to be run automated on a regular basis. We use Ansible for things that are more like, I want to do this thing. I want to do a deployment. And Ansible runs out, manages the deployment. We also use Ansible for things like upgrading OpenStack. We use Ansible for things like managing, upgrading databases. We also have a bunch of one-off Ansible jobs, like if we have a compute node die, we have a way to evacuate all the VMs off the node safely, that kind of thing. If you need to do things in a certain order, Ansible's pretty good at that. If you need to make sure they're in a certain state, Puppet is good at that. Yes? Yeah, you use Job Builder, which is awesome. What do you do to make that master a little bit more stateless in terms of shipping logs out, if that's your dashboard, to let you know what's rolled out? How do you sort of escrow those logs? Have you tackled that problem yet? We have not, I don't think. So I'm sorry, what was the question? So you let Jenkins, configured by Job Builder, orchestrate all the things? Yeah, so we use Jenkins Job Builder to configure Jenkins. Yes. Do you do anything to get those logs off of Jenkins, to let you know what's on the runway in the state? We should, because they fill up, and you have to basically trim how many you can keep. I'd like to do what Infra does, which I think it is push all the logs off to another server, Swift or something. OK. And then as far as shipping relocatable VMs, what do you do for the PyPy dependencies? Do you just run an internal PyPy mirror, or? Yes, we don't mirror all of PyPy. Correct. So we don't have enough, that's a lot. And so we mirror, for example, for the designate virtual environment, we mirror what you need for designate and for horizon, we mirror what you need for horizon. We've got a certain repo, and if you check in a text file there, we'll go through that. We'll treat that as a requirements.text. And then we'll build a repo based upon what those requirements are. And so you'll want to provide us a pinned down version, so you do like a PyP freeze. And then you'll just check in that as a text file, and then we'll build a repo for you based on that. And then we'll keep all those wheels around. That's one more. So the question was, in the shared hardware dev environment, do we deploy the tagged version that's about to go weekly? We actually don't. And I think we probably will, especially with that integration test, we could go into Jenkins and say, run an integration test with this tag, and then we could go back and say, we expected these things to happen, and that's definitely what we see in the logs. That's something we want to move to. But generally, the shared dev is kind of ahead, because we tag off, and then maybe we don't deploy to the next day. We don't want to make everyone wait. We deploy master to share dev five, six times a day. Right now it's ad hoc, but eventually I think it'll be just on a per-check-in basis. For the benefit of recording for YouTube, if anyone has another question, please pose it from the audience mic, or have the speaker repeat it. Thank you. Anything else? Oh man, it's a lot of jumping. Yeah, I've actually got one more. Do you guys ever have to tackle the problem of shipping a patch that hasn't landed upstream yet? And if so, how do you do that? Yes. All the time. We do that all the time. That's right. So we use git upstream, which is a tool that's by, I think it's by the infer team. But yeah, it allows you to more or less mirror a repo that's upstream, and then you can carry a local patch. And it's got some magic in it. When you check in your local change, if you match the change ID with what's upstream for review, it'll automatically merge them later and collapse those changes down. So it's kind of like you can fork for a little while, and then go back and go back onto master. So it gives us the ability to carry patches like that. So we have those jobs set up for all our public modules and for the interesting open stack modules that we do a lot of work on. Do you know if git upstream accommodates the ability for ever going to land, forever going to run at my site, but never going to land upstream patch? I have some site specificity that counts for that. Yeah, and that's what it does. That's what it does. And so it'll try to carry your changes, always kind of at the tip of whatever the master branch is. Trying to rebase you to the top. Yeah. As long as that's successful from a git point of view, you could hold it forever. Cool.