 Hello, everybody. My name is James Blair. And before I talk to you about Zool, first I'm going to show you a little bit of Anzi art. I work for Red Hat. I'm on the OpenStack project infrastructure team. I'm particularly focused on the CI system, which runs a program that we wrote called Zool, the next version of which is based heavily on Ansible. So with that out of the way, I'm going to tell you a little bit about what I'm going to say today. I'm going to tell you a bit about the current version of Zool, which is that version 2. That's what we're running in production right now for the OpenStack project. I'm also going to talk about the next version, which is in development. We call that version 3 and give you a preview of some of the features that we're adding to that and some of the differences between version 2 and version 3. So who here, by a show of hands, has submitted a patch to an OpenStack project? And who would like to in the future? OK. I know that's an ambiguous question. Some of the people who raised their hands the first time did not raise their hands the second time. OK, that's helpful. So if you've submitted a patch before, you might have seen this in action. And also, if you saw the keynote yesterday, then you probably know some of this as well. But at OpenStack scale, our CI system is huge. We have 1,500 Git repositories. We launch at peak, more than 200 jobs per hour. Over the course of a month, we merge 10,000 changes across those repositories. And our CI system runs in 20 regions across nine different clouds. At least I think that's still the case, because I don't think we've quite reverted the patch that Jonathan Brice pushed up yesterday. So if you're a developer, you've seen this before. Actually, you may not have seen this before. But when you submit a patch to Garrett, there's an interface that looks like this for me. But most other developers use the web interface. So it actually looks like this. And this is actually the primary interface to Zool that we want developers to see. We actually want Zool to stay out of the way, to sit in the background and serve developers and not be all, hey, I'm Zool, look at me. So when a developer submits a patch, and when reviewers go to review that patch, they see a page that looks like this, where you have your commit message. Down and off the bottom of the screen is our review messages from reviewers. But if you look in the bottom right hand corner, there's a little thing that has the jobs that ran on that change and their status. And those are actually hyperlinks that show you the logs for those jobs. But that's all we really want users to see most of the time. And I realize that that doesn't look super cool and whatnot. But like I said, we're here to serve. Having said that, however, there actually is a lot going on in the background. And Jonathan brought this up on stage at the keynote yesterday. You might have seen it. So this is the Zool status page. It shows you all of the jobs that are running in all of the different configurations for all of the changes that are in flight, how much time is left on that sort of thing. So you can see there's actually quite a bit going on in the background. This is, if you zoom out a little bit, it looks like this. Because as I mentioned, it's a really busy system. And I couldn't even fit it all on that page. So it still goes off the bottom. We also have a complementary program called OpenStack Health, which lets you deep dive into the results of jobs that Zool has run. So we actually have a lot of different ways of exposing this data to users and developers if they want to dive into it. But like I said, generally, we try to stay out of the way as possible. This is a diagram of Zool's architecture. So you have an idea of what some of the parts I'm going to talk about in a little bit. But as mentioned earlier, a developer who's typing on a computer that has an eerie green screen sends a patch up to Garrett. Zool listens to Garrett on an event stream and responds to various events as it's configured to do based on users pushing up changes or reviewing changes and things like that. Zool is a distributed system with a number of different components. So once it gets a change from Garrett, it asks its merger to pull that change down and collect some information about it. In OpenStack, we run eight different Zool mergers. And those are basically servers that sit there all day, every day, doing nothing but performing Git operations. They keep pretty busy. So once Zool finds out what it needs to about the change from the merger, at that point, it knows what jobs it needs to run. And those jobs have certain requirements as to what nodes they might need to run on. So it talks to node pool, which is our component, which talks to those 20 regions across nine clouds. It spins up OpenStack instances to run those jobs. And then it handles those instances over to a component called the Zool launcher, which runs Ansible. And Ansible connects to the nodes that were provided by node pool and actually runs the jobs. So node pool is worth talking about just a minute, because I'm going to spend the rest of the time talking about Zool. But node pool is a separate program that works very closely with Zool. It has a different name, but we consider it part of the same suite of programs. They're very tightly integrated. In addition to spinning up nodes, like I just described, it also, once a day, it builds new base images and uploads that to all the different clouds. That's something that a smaller CI system may not need to do. You might just be able to use your upstream image provided by your cloud provider or your distribution. Since we're running across so many different clouds, the little differences between those base images make a big difference to us. So we normalize that by building our own images. We also go ahead and cache a bunch of things that we're going to need for our jobs on those images. And we upload them every day to all of the clouds so that they behave consistently. And this system exercises OpenStack very heavily. If you remember that 2,000 jobs per hour number, since each of our jobs runs on a VM that is created just for that job and then destroyed, we are across those different clouds creating and destroying 2,000 OpenStack virtual machines every hour. So it's quite a useful stress test for OpenStack. It's also useful to know a couple of definitions of a couple of words that I'm going to use as we go on. The first is gating. Gating is very central to what Zul does. It's actually the reason that we created it in the first place. And it's a very simple idea. It's the idea that every change that's proposed to a repository is tested and it must pass those tests before it merges. Cogating is a variation on that, where if you imagine that you have more than one repository that are related, then you make sure that the changes to each of those repositories are tested against the state of the other repository before each one merges, essentially so that they can't break each other. And you can imagine in a world where you're building up complicated services out of microservices or even in OpenStack's case, macroservices, that being able to make sure that one of those services doesn't break another one or the whole could be a very useful thing. And then finally, parallel code gating is actually what Zul does at this point in time. And that's if you imagine what I just said about code gating, where you're saying, I'm going to land a change to this project and land a change to another project. I don't want them to break each other. You might implement that by just testing them one at a time in series. That's very easy to do and it's very correct, but it's also very slow. And there's no way we could land the change volume that we do if we did it that way. So what Zul does is it creates speculative future states with a bunch of changes together, tests them all with the assumption that they're all going to pass. And if anything changes, it rearranges things as needed to make sure that they can merge. So we can actually get a very high throughput of changes merging by running all these tests in parallel. So I'm actually sort of a visual person. So if that explanation sort of made you very confused or the fact, the way that I rambled on about it made it confusing, then I have this little visual illustration of the process that can hopefully help you walk through it. So if you imagine that we have two projects, let's call them Nova and Keystone. And they have two Git repositories. And up there on the screen, there are two yellow dots with some funny hexadecimal words next to them. Those are the branch tips of those two repositories. So a developer might come along and approve a change to Nova. Zul notices that and cues it into its pipeline. And then it starts running jobs for that change. Then another developer comes along and uploads a change, sorry, approves a change to the Keystone project. And Zul starts running jobs for those changes. And then, say, a developer approves two more changes to the Nova project. At this point, we've got four changes in Zul's queue. Jobs are starting to run on all of them. I'm sorry, the blue is a little hard to see there. But it's meant to illustrate the job run time as this goes on. So the thing to know about this series of changes up here is that the first change, that Nova number one, is being tested against the tip of the Nova repository. The second change is being tested against the tip of the Keystone repository. But in any integration tests that involve both Nova and Keystone, it's also including the change ahead of it, that number one change. And by the time you get down to the bottom of the screen, Nova number four is running with all four of those changes in place. So then something terrible happens, and one of the jobs on that Keystone change fails. At that point, the future state that Zul has created in order to test changes three and four is no longer valid. We know that since we don't allow any change to merge if it fails its tests, that Keystone number two is not going to merge. And so Nova three and Nova four are being tested with a change that isn't valid anymore. So Zul cancels the jobs that are running for that change. It sort of moves the Keystone change off to the side because we want to keep running the jobs for that so that we can report back as much information as possible to a developer. But then Zul reparents changes three and four on top of change number one and restarts their jobs. So these are now running only with the changes that could possibly land ahead of it. Those jobs continue. The jobs for that first change finish, they succeed. And so we merge number one into the Nova Git repository. The rest of the jobs for Keystone finish. And at this point, we're ready to report back to the developer that this change failed its tests. So it didn't merge, but here are all the results. You can go look at it. And then changes number three and number four finish up and merge. So you can imagine that, so this is a series of changes that were enqueued in Zul in sort of an arbitrary order, just by the happenstance of when people approved certain changes. Zul also gives us the ability to control this deliberately. So we can actually create a series of changes and tell Zul that they depend on each other. And Zul will enqueue them in that order so that they're all tested appropriately. So if you imagine making a change to an OpenStack project, say you want to add a change to Nova that returns the SSH host keys in the instance metadata. I think that would be a great change. What do you think, Monty? So if you implement that change in Nova, that's really just the start of the process. Then perhaps in order for that change to take effect, you might need to configure DevStack to turn that feature on with a feature flag or something. I don't know why, because it should be the default, but let's say you have to do that. So then you write a change to DevStack and you say that this change depends on the change that I just wrote to Nova. That means that when DevStack runs, it's tests against this change that you just wrote to DevStack. It will pull in your change to Nova and run with it. After that, you might need to write a change to Nova client to actually expose this in the Python API. You can say that depends on the DevStack change. So then when Nova client interacts with a running DevStack Nova instance, it's able to actually fetch that data. And let's say that your ultimate goal is to add this to NodePool. Well, NodePool uses the Shade library and so you might need to add support to this to Shade and you say, well, Shade uses Nova client to get this information. So this Shade change depends on the Nova client change. Then finally, in NodePool, you can say, my node pool change depends on the Shade change. And when the jobs for this node pool change runs, they will contain all of the changes ahead of it. And you can test this entire process from end to end, from exposing the feature in Nova to using it as an end user in NodePool. And you can do that all in RCI system without landing a change to any of these repositories. So you can go to all of the developers of all those different projects and say, look, this thing works end to end. Let's land it. And that's a very powerful feature that Zool provides. So one of the things that we've wanted to do in Zool for a while because, I mean, I've been spending quite some time talking about all these tests we run. And it turns out that when we change Zool's configuration itself, it's surprisingly untestable at the moment. I mean, we run a lot of tests that validate the configuration syntax, make sure that you don't do anything silly and you have the appropriate amount of white space. But what we can't do is, in some cases, is say, well, I wanna actually run this code in a job and make sure it works before we land it. Sometimes we just have to say, yep, that looks right and land it and exercise it. So in the height of irony, our CI system itself is not as CI'd as we would like it to be. So in Zool v3, we're working on a feature to allow Zool to process configuration changes to itself live on the fly. And the way we do that is kind of interesting. When Zool starts up, it has almost no configuration whatsoever. This is essentially what a Zool bootstrapping configuration file will look like in v3. We say that we have a tenant called OpenStack because Zool v3 is multi-tenant aware. It talks to a Garrett. It has, you can find its configuration in a Git repository called Project Config, which is the repository where we have all of our jobs defined right now, centrally. But also, there are a bunch of projects that are basically the Git repositories that Zool itself is working on. For instance, Nova, Keystone, DevSatGate, et cetera, we have 1,500 other of those. And Zool will actually pull its configuration from all of these repositories. The reason why there's a config repository and project repository differences, there are some things that we still only want to be able to manage centrally. But like for instance, jobs that require authentication credentials or something like that, we can't have somebody uploading a job that says, hey, print out the authentication credentials. So there is actually a little difference between the config and project repositories. But Zool in general, in most cases, will be able to pull its configuration from any repository listed in this config. So when Zool starts up, it loads that up, loads that config file, and then it talks to those mergers that I mentioned earlier. As I said, we have eight of them. So in our system, it's going to say, hey, mergers, go to all these repositories and get me a list of all of the branches in all of those repositories. They'll go churn on git operations for a while, return that list back, and then Zool says, okay, mergers, go to every branch of every repository and look for a Zool.yaml file and return that to me. And so as Zool starts up, it goes from having no configuration whatsoever to having this fleet of distributed workers go and collect its configuration from thousands of different places, return it back to Zool, where it assembles it into something that a holistic whole configuration that it can apply to the whole system. And because of the way that runs, and of course, because Zool is seeing every change that goes through the system, whenever there's a change to a Zool.yaml file, Zool can go back out to the mergers and say, hey, look, there's a change. It might be changing my configuration. So go get that content for me. It then takes what it gets back from the Zool merger, splices it into the running configuration that uses just for that change and uses it to decide what jobs to run and how to run them for the change. And because this is implemented just with the regular Zool primitives, this works with cross repo dependencies too. So you can have a change in a project that depends on a change to Zool's configuration in another project. So, with... Mm, yes. Can I break Zool's space? I'm sorry? Can I break Zool's space? Right now, I hope, probably, but I hope we'll have enough testing that you won't be able to break Zool that way. So yeah, part of the project config and repo config split is to make sure that things that are very dangerous don't necessarily break Zool this way. And of course, if your change to the Zool.yaml fails, then, like if Zool can't even parse it, then, well, that change is going to fail its tests. And so it's essentially being CI'd in the normal way. So, with that background information, I think it might be a little useful to show you some of Zool's configuration. Zool is a very free-form system. We have a lot of concepts in how we've defined OpenStack's workflow that are not understood by Zool natively. We've built them out of very simple configuration primitives. And so with just knowledge of a few of these, you can build very sophisticated systems. So the first primitive is what we call a pipeline, and that's essentially a process definition that connects up Git repositories, jobs, and triggering and reporting mechanisms. They're basically the same in V2 and V3, so if you've done anything with Zool and V2, that's not going to change a lot between two and three. So in OpenStack, we have something that we call a check pipeline, and that's the pipeline that runs jobs on a change when somebody uploads them for the first time. And so this is a simplified view of what that configuration looks like for us. We basically say that this pipeline looks for triggering events from Garrett. Patch set created is what it emits when somebody uploads a new patch set. Change restored is when somebody restores an abandoned change. There are a few other things we listen to, but I've omitted them for brevity here. But basically a pipeline is a free-form thing where you say something triggers in queuing into this pipeline. When we run jobs, we report back via several, one of, well, any number of mechanisms. And so down here at the bottom, what we say is when jobs succeed in this pipeline, report back to Garrett with a verified plus one vote. We could also send email, report to a database, throw something into IRC, or nothing at all, in fact. We have also something that we call in OpenStack, the gate pipeline. This is the thing that actually prevents changes from landing unless they've passed tests. There's a subtle difference between this one and the check pipeline. It says manager dependent. Check pipeline says manager independent. That triggers that behavior change that causes things to be in queued one after the other. Again, we're looking for things from Garrett, except this time we're looking for a comment event for somebody saying that they've approved this change with a workflow vote. And then if changes pass their tests, report back to Garafide with a plus two vote in the verified column and submit the change. Submit is Garrett speak for merge the change. And that's how that magic happens. Once you've got some pipelines defined, you might wanna configure some jobs. Jobs, as I mentioned earlier, they run on nodes from node pool. Right now, all of our nodes are dynamic, but in v3 you're going to be able to check out static nodes from node pool as well and run jobs on them. The metadata for jobs are all defined in Zool's configuration. So that's things like how long the job runs, when it should run, what kind of nodes it needs to run on things like that. But the actual execution content of the jobs is defined in Ansible because we did not need to invent another way of describing how to run things on remote hosts. There's a great tool out there that does that already called Ansible. So we decided to use that. Jobs can be defined centrally, like in the project config repo, or in the repository being tested, as well as their execution content. And then in v3, the configuration syntax is a lot more flexible. We have contextual variants of jobs that can run jobs with slightly different parameters based on different conditions. And I'll give you an example of that in a minute. Jobs also have an inheritance structure. So you might start off by writing a job and where you set up some sensible defaults for your whole system. So this is a job called base, which has a 30-minute timeout. It runs on an Ubuntu Zendial node, a single Ubuntu Zendial node. When Zool prepares all of the Git repositories that it needs for whatever might be involved for this change, put them in a place called opt workspace. And before you run the playbook for this job, run a set up post playbook. That might do some things like set up local configuration for the job that's needed for CI. And then post run is what to run after you've completed running the job. So once you've defined your base job, your typical job might actually be very simple. You might simply say, there's a job called Python 2.7 and it inherits from that base job. And then Zool will know to look for a playbook called Python 2.7 and that's what it'll run for the job. A variant on that job might be to say, well, if you're going to run the Python 2.7 job and it's on a change to the stable Metaka branch, use a trusty node for that instead of the Zendial node. If you need to run a job that requires more than one node, then you might have a definition that looks like this where instead of saying your node is Ubuntu Zendial, you list out your nodes and you say that I need a node called controller and it should be an Ubuntu Zendial node. And then I need a second Zendial node called compute. And then what Zool will do is it will take both of these nodes, drop them into the Ansible inventory file with those names. And so you can refer to those nodes by name in your Ansible playbook. Once you have jobs defined, then you're going to want the jobs and pipelines defined. Then you're going to want to have some projects defined which hook those jobs up into those pipelines. So a simple version of the Nova project might look like this where you say, this is a project whose name is Nova. And for any changes that match the requirements for entering the check queue, then run the Python 2.7, Python 3.5 and the docs jobs. And those would of course be jobs that you've defined previously. In the same way that you can have job variants in the configuration file generally, you can also have project local job variants here. So here we're saying in addition to those three jobs I said before, also run the PyPy job except on Nova PyPy isn't voting. So run it as you normally would, but don't vote. Another slightly more complicated variant might say, only run the docs job if somebody changes files in the docs directory. That's actually a terrible idea, never do that. But the reason for that is complicated. But this sort of thing is very flexible and there are sensible rules that you might want to set up to run jobs under certain conditions. And here's another example of a variant where we're saying no matter what, we're going to run a Python 2.7 job. Normally run it on a Zennial node, but if you're running on stable Newton, run on Indemutu trustee node, just for Nova. I don't even know if that makes sense, but that's the sort of thing that you can do with local variants. And then of course jobs can have dependencies on each other. So here we're saying in the release pipeline, which is something that we run when somebody tags a repository, build a tarball for that repository, upload it to our tarballs and PyPy site, and if that succeeds, update our local mirrors. And so these jobs only run if the job before it succeeded. So you can build up this pipeline of jobs this way. And then finally, the actual, as I mentioned, the actual execution content of jobs is specified in playbooks. Playbooks can be defined centrally or in the repository being tested. They can use playbooks, sorry, they can use roles from other projects inside of Zool or roles out from the Galaxy. And we actually expect our playbooks to be very heavily role-based. For instance, a simplified view of the playbook that we use for running our DevStack and Tempest tests might look sort of like this, where we say that first set up the multi-node networking for the node that we're running on, partition the swap, configure the mirrors, then run DevStack, then run Tempest. And then all of these roles can actually be defined in other repositories and they can be used in many different projects, not just the projects that we're using them in today. So in fact, a lot of projects in OpenStack are running tests on our DevStack nodes. They're using DevStack to set up the environment, essentially just to get some of the side effects from how we set up that test. So by decomposing this monolithic job that we have today in the multiple roles and then having them available for any other project that we use will make the configuration of all of these projects a lot simpler and allow for reuse across them in a way that we're not able to do today. So if of course you're, you know, perhaps you're not into, you don't need all of the advanced features of Ansible, you're not, there's a lot that you can do with Ansible in Zool, but maybe you just need to write a job that runs a shell script. It's a pretty common pattern in a CI system. You can do that too. Ansible of course is perfectly capable of running some shell and so a playbook that just runs your run test script would look like this and so if you need Ansible to get out of the way, it can, but the thing that we really like about this system is that it lets you use the full power of Ansible in not only in your production system but in your development system and your CI system as well. And if you write your playbooks well, you'll be able to use the same playbooks in both places. You can use your production playbooks to run your CI testing. You'll just provide, Zool will provide them with different variables files, different inventory files and magic will happen. So with that, are there any questions? Yes. The question is, could you swap Ansible for some other system like Puppet or Chef? But you also said private cloud, so I may not fully understand. So we're basing this version pretty heavily on Ansible. We're not looking at having a facility to swap out some other mechanism for running. It's theoretically possible. I mean, you know, at some point you're handing something off to another component and it could run. But in OpenStack, we actually, we of course, there are OpenStack projects that use Ansible for deployment, but there are also OpenStack projects that use Puppet and Chef and whatnot for deployment. And we're going to run the jobs for those with the system as well. There, whether it ends up looking like this, where it's, you know, there's an Ansible playbook that just says run Chef or run Puppet, that's of course an option. There is in fact an Ansible module for running Puppet that we wrote and we use in our infrastructure because we actually use Puppet to manage the OpenStack infrastructure itself. There's an Ansible module that's really good at running Puppet and so you might end up, probably what we will do in fact, is end up writing playbooks that say, use the Ansible Puppet module to run Puppet and we'll actually have some really good integration there. Yes? Just curious, are you planning or, yeah, are you planning to basically have an eventing system that isn't Garrett-based so I can trigger things that aren't coming from Garrett but are coming from some other random location? Yes, absolutely. The, probably the first one, so we're going to make that an extension point and there will be like a nice API there so it'll be easy to add new triggering and reporting mechanisms. The first one that we're going to add is probably gonna be for GitHub because we'd like to run this CI system for the Ansible project itself and they use GitHub. We'll also probably, we're actually looking forward to using it in OpenStack as well because sometimes we have dependencies in OpenStack on projects out there in GitHub and we would love to be able to say this patch depends on a change that's a pull request in GitHub and the interaction between those two systems I think is going to be really cool. Yeah, yeah, so the triggering system is actually separate from the source system so there need to be Git repositories involved at some point at this time but the triggering doesn't have to come from where the Git repositories is. So where it says sourceGarret, that's actually, I glossed over this because they're the same here but sourceGarret is saying get the Git repositories from Garret. TriggerGarret is saying get the events from Garret. So you could say sourceGarret and then trigger some RPM build happen somewhere. With the Ansible runs, how does the console output link back to the publishers? So in the current version of Zool, we've been working on, we've sort of been trying out some ways of getting that information back. Monty Taylor has written a really cool Ansible plugin that basically substitutes for the Ansible command module and captures the standard output and it goes into a log file that can be streamed in real-time by users. So you'll be able to have the sort of real-time streaming log file that you're used to in CI systems without having to change your playbooks at all. Does that answer the question? Yeah, cool. Anything else? All right, thank you all very much. Thank you.