 Hello, okay great. I'm gonna get started since it's 1140 So Hey, so welcome everyone to our presentation. Thanks for coming if you haven't guessed by the title We're gonna be spending some time today talking about how you can use Zool for your own systems And we're mainly gonna be going over Why we moved to Zool and some of the benefits that came with it especially compared to Jenkins Some fundamentals of Zool and how to onboard your own projects to Zool and a retrospective of our Zool journey so far So without further ado More introductions. I'm rich. This is Howard We're both engineers on the workday private cloud team, which we affectionately call WPC But more on that later So before we get into some of the technical weeds of How we use Zool we just want to set some context for what workday is for those of you who might not know so Workday provides cloud-based software for companies to manage their finance and HR more efficiently So it sounds simple on the surface, but as you can see we operate at quite a large scale Workday currently provides services to over half of the fortune 500 and this amounts to over 60 million workers And workday on top of this strives to maintain a customer satisfaction rating of over 95 percent Which we've successfully achieved for over a decade. So that's something that workday really prides itself on So how do we achieve this level of scale and reliability? It's such a high rate And of course we use open stack if you haven't guessed So we began our open stack journey many years ago starting with our first production clusters running ice house And since then we could you could say we've developed a bit of a bigger footprint We're a relatively small team inside of workday consisting of two development scum teams and one SRE team Together we run 87 clusters and growing each cluster contains up to 300 compute notes and all in all this amounts to 3.25 million cores 12 and a half petabytes of RAM and 85,000 concurrent VMs We also hold ourselves accountable to a 99% API success rate during workdays Several-hour weekly maintenance window where we basically destroy and recreate the entire workday stack So this amounts to 241,000 VMs recreated weekly and as you can imagine puts quite a lot of load on our control planes So given all this information you can kind of start to see why Zool and CI is so important to us It's imperative that our platform stays extremely reliable So that we don't produce any kinds of regressions either from performance or deployment We need to be accountable to our customers in our patch window So before we start talking about our current system I kind of want to just talk a little bit about what we dealt with before we had Zool in our older version of WPC and This should kind of help paint a picture of why we decided to transition to Zool so note that some of this obviously isn't ideal and We're putting it for the sake of comparison, but it's it's something that worked for us at the time and scaled well for us enough But obviously we used Jenkins to run our CI We shared this with other teams in workday. So there is another team that managed Jenkins for us It was great for maintenance overhead, but not great for flexibility and speed For instance, if you want to set up a new pipeline, you'd have to ask another team and wait several days To get back to you and you just generally didn't have a lot of power We had limited jobs pre-merge So basically a couple tests just to check a single patch and make sure that it seems okay And we relied on periodic jobs Entirely for post-merge to catch any regressions and the main part of this resulted in that all code was merged by hand So it doesn't matter how and when you got a plus one from your CI or from another person giving you a plus two on your View as long as you got it at some point you could merge so if someone Reviewed your code on Friday you could merge on Monday and Just pray that it works And also we use chef for our deployment so this was a weekly point-in-time snapshots of chef changes across multiple workday teams including our own so When we had to roll out changes, we'd roll out everyone's changes at once including workday infrastructure changes So feature and bug fix rollout was gated by config changes We didn't have any stable release branches and we couldn't really do this at the time So given all these problems with our old CI what kind of benefits to Zule bring us I can help So the first one as I mentioned before is stable releases So previously because we used point-in-time snapshots for releasing there wasn't really any need or even ability to branch But with Zule we can enable a job variance and actually use branching to create stable releases So this allows us to upgrade and roll back our code much more easily We can also release our own open-stack specific changes independently from company-wide changes So we're not at the mercy of this big blob of point-in-time snapshot of chef changes Portable jobs are great too This essentially refers to the idea that given any Zule job you can run the same job across any repo to ensure that all Changes are compatible So for example if you have repo a and repo b and they're both used to build WPC and we know that the same job works to test Both of them we can be much more confident that our changes Will not break our system in addition to testing cross-project dependencies, which we'll talk about a little more later Loggings another big thing It's hard to understate how much easier it is to look at logs and Zule compared to Jenkins You can kind of expand and collapse your pre and post ansible jobs get like nice color-coded Ansible hints for which tasks may have been skipped or passed or failed Whereas Jenkins is just like this huge blob of text that it just executes a thing and you have to manually search for errors In kind of this really not user-friendly way So even better We found it useful to create our own rr server at some point for further ansible logging, but we can talk about that a bit more later Gating is probably one of the most important parts of Zule As we mentioned before we kind of had like a semi-manual version of project gating in Jenkins Like we would allow our test suite to run first, but we still ended up merging the code by hand but now where we've gone past the days where We can have two incompatible changes that can break our CI because of Race conditions when changes were merged so for example and count out what I was getting at before if we have Two developers they get plus ones on like a Friday and their code looks good to them They merge on Monday and then suddenly those two changes break each other that doesn't happen to us anymore so it allows us to take advantage of a Parallel speculative execution of our tests and we move much quicker and we don't break things nearly as much Auto holds also really useful In Jenkins we kind of had pretty limited control over Our system due to not being the winners So if a job failed there usually wasn't really much we could do besides kind of just rerun the test or make some changes and see what happened again in Jenkins But with Zool you can actually auto hold your nodes in the event of a failure So this helps solve the age old riddle of well it works on my machine Why doesn't it work in the CI because you can actually pause the CI and look at it and See exactly what's happening and run the same commands And of course all this culminates in fewer blockages and happier developers, which is great and Anecdotally our on-call chips have seen much fewer breakages in our CI which is great And it's pretty noticeable. I don't have data to back that up, but you can hopefully take my word for it So I'll hand over to Yeah, so We kind of look at this as kind of a two-part conference talk last Last time in Berlin we talked quite a bit about our migration platform on how we converted our open stack CI part to To using Zool today the rest of the talk We wanted to kind of talk about how you can kind of use it and how you can make you know non-open stack projects and Zool as well So the the crux of this is you you ought to use Zool Now while we need to assume you have a Zool system I figured it'd be pretty good if I just kind of showed a simplified layout you notice that Zool has to use Garrett so you've got to have a Garrett system running because that's where it's going to pull it on and It's going to check out different changes and branches of each of your projects there And then it actually farms this off using Ansible out to Out to these ephemeral nodes we use VMs for ours They're just open stack VMs that we spin up for these kind of processes and then it you know Ansible calls it So let's show a few things that you need to know to kind of convert your existing job over to Zool so in your project Why is there? Okay, here we go. Sorry. There's a blank screen in your project You you create a Zool dot D directory and it has two files in it. There's a job The jobs file defines the jobs, you know linting unit tests Publishing anything that along those lines then there's a project file and this references those jobs for the various pipelines Like check gate post after merging that thing then Then you create the jobs now most these jobs are Ansible playbooks, but we'll talk about a nice talks layer that you can use Now this is an example project file I tried to make it interesting and I may have opened up more questions here I'm doing a release inside of a gate. This relies on delegating secrets Which I'm gonna talk about later. You should probably Trigger your releases for the post merge pipelines not during the gate here But keep in line the pipelines come in two flavors There are pre and post merge and some things like secrets are only available in post merge code pipelines you generally want to keep your pipelines as Similar to each other that eliminates some of the problems that you might have like there's an error And you don't know whether it's your code change or whether an environment change or whether something that's being integrated change So keep the pipelines or the jobs the similar Now this is an example project file Okay, there's the jobs file so here's my example jobs file The names are references from the pipe the project file and they each specify one or more playbooks Note that the playbooks run somewhat in isolation. So it's difficult to pass state from one Playbook to the next even if it's part of the same job Now a playbook called by Zool can do, you know, all the normal stuff that Ansible does Now note that you can't become root on local hosts because that as you saw in that initial architecture thing That's the executor. So most of these things you'll want to have as on the builder Rich you want to talk about talks? So if you're not using Ansible and you want to go with a pure Python project Talks is a great way to run tests for your project For those of you who don't know what talks is a quick summary is in general talks Is a framework that helps you to standardize your? Python testing environments. So for example, let's say you need to maintain your Python project for multiple versions of Python You can set up the test environment appropriately Based on each version so install dependencies particular to those Python versions and run your tests ensure that you haven't broken your project for Some other version of Python and as you can see it's pretty simple to set up talks Honestly, you pretty much just define a talks I and I file on your project and then you can call it using the talk CLI So in this example, notice that we set up a linter environment and a unit test environment for a Python three eight And note that even if your project is not pure Python You can still at least take advantage of using it as a linter especially in conjunction with Zool because it kind of sets it all up for you So for example, if you want to lint some Ansible playbooks, you can use talks for that, too So it's not limited to just Python even though that's probably its best use case Yeah, so This is pretty much all you do to set it up Once your talks I and I is done you just set up in your project yaml You call the talks linters and the talks Python environments, whichever ones you've set up and Zool kind of figures out the rest for you And this is what happens when you're able to leverage existing community-built jobs on Zool Which is really great for us. It's made ourselves a lot more efficient and we can like pick and choose what works for us from the community So if you're kind of wondering, you know, how this works Because I know I know I've kind of abstracted a lot of it and just said oh Yeah, just put this here and put this here and everything just works together We can talk a little bit about how it works under the hood just so you have an idea so how Zool generally works is that it has a set of Pre and post ansible jobs that set up and tear down the system So this kind of helps keep the logic of your tests separated from the Zool internals So you can have any number of pre and post jobs that get called in a hierarchy, which we show on the next slide but all jobs have some amount of privilege set up like Setting up SSH keys or getting the change ID from Garrett so you can see that's in step one And then in step two and three talks around its own Unprivileged pre post and base jobs to do its own set up and tear down and running the tests So it calls bind up to install the binary dependencies And then it actually does the job of installing Pippin talks and then it finally calls the test in the lenders And then finally after all that's done the privilege phase From the trusted base job Uploads the logs to Zool so again, it's all kind of done for you You can look at it and that the upstream jobs like I know like People send commits to the gate in the open dev repositories You can just honestly click on the Zool jobs there and see what it's doing and you can kind of mirror What's happening there to your own system? So it's a great example to follow So here kind of what I was alluding to before Here's like our Zool job hierarchy You can see every job has a parent job Which may have its own parent job, etc. And until it gets to the base which runs your actual tests So you can see the talks run dot yaml is basically where your actual tests are run But everything else is all the setup and tear down And again, each parent can have one or more of these pre and post jobs Notice how some of these jobs are trusted and I've kind of alluded it to before when I was mentioning privilege and Triple those jobs, but we'll get to that later So yeah, now that you've got your test set up in either Ansible or talks Let's say you want to test your inter-product dependencies. So one of the coolest parts of Zool. So All you do in is in your jobs.yaml you define your acquired projects So that's kind of in the bold in the towards the center of the slide In this case, we're requiring that we need to have the myprojrpm build dependent And then in your commit message for your separate change all you do is you literally put Depends on and then the actual Garrett URL of your change and then what this does is a Zool creates a directory for the running job and It checks out all of their acquired projects as instructed. So if you have a depends on defined It will check out that change specifically Otherwise, it'll just default to the main branch or pretty much any branch that you're setting and commit to So say you're just sending a commit to Stable a release branch for a bug fix it'll check out those branches too. So Yeah, the inter-product dependencies works great And Then as a bonus for those of you who use Kola This works seamlessly with Zool and it's it's designed to work this way, which is really great And it works great for our setup. So for those of you who may or may not know In Kola bill.com you have a specific syntax that you would use in order to install dependencies here images So it's kind of that crossed out thing there as an example So in that example, we're trying to install this myprojrpm to all heat images So you may be initially tempted to do what is in the crossed out version But note that this is actually terrible practice. You should not do this This is you're defeating the purpose of Zool if you do this You don't want to be tempted to directly install packages from a static source URL Instead what works really great for us is we have a Templated out Kola bill.com using ginger and then we loop over all of the source names that we've checked out in Zool on Any of our depends on commits and we install from disk instead and it's really important to install from disk Because this will actually properly test your cross-product dependencies. Whereas if you were just Testing from this URL, you're just gonna get the same thing every time and you're not actually testing the dependencies Zool will check out all of the required repos into a specific directory per job run to run the tests And in the past like we kind of had to like unlearn this huge clunky flow We had with Jenkins especially for repos that did generate packages and RPMs like we used to have an entire workflow where the job or the repo would generate the RPM we'd upload it to Repository package repository and then consume it later in our periodic jobs And this is just like a whole mess and it's something we actually had to unlearn how to do because it was bad practice And it defeated the purpose of Zool. So that's just kind of a quick bonus for Kola integration so Now unlike Jenkins Zool doesn't have a vault of shared secrets for all jobs And that's by design as often secrets in Jenkins would accidentally be you know leaked to the logs and Zools publicly available. So you don't want to have something like that So while these tools are available Zool kind of forces you to think through how you want to share secrets and Just use the secrets so As we talked about before some jobs There are trusted and some are not you see this in this screenshot of the just the last few projects on our Zool system Untrusted jobs are meant to be jobs that you test before merging The trusted config jobs, they're the opposite. They're tested after merging So don't worry too much about this, but just note that the Zool base jobs here The Zool base jobs here are Kind of the ones that can access encrypted data and make it available to other untrusted jobs But typically you put your secrets in the project that needs them So Let's do this the Zool client CLI. It's a lot like the Ansible vault command If you're familiar with that it allows you to kind of encrypt something into a yaml formatted data structure however, it uses a public key rather than symmetric encryption and Then we can take that output and just drop it into our secrets yaml file and then in our post or in our Playbooks we now have that stuff available, but only in the post merge pipeline That is the periodic jobs post tag pipelines those that happen after merging Now in our case, we actually wanted to have the same secrets available to all 200 plus projects that we had because some were publishing rpms to Our repository our repositories and things like that. So we had kind of a slightly different twist on this so Typically the Zool executor has access to those Collection of secrets we put it in the Zool based jobs and it can run these trusted playbooks from from this project Now if you run this playbook On a given worker node here the playbook can then issue a docker login or you know something equivalent to to establish a connection to the trust to a trust store and Then untrusted playbooks executing later than have access to this Now very similar to what we did before we can encrypt the secret, but this time we're going to put it in the the Zool based jobs and Then in the base jobs jobs yaml file We're going to reference that secret that we just added now No, when we run the untrusted playbook that secret won't be there anymore So what we need to do is use a delegate so And I'm afraid this is going to be a homework exercise for each of you because you're gonna have to use your own system and your Own trust stores to get to to put those secrets on those Those build nodes. It's kind of a risky procedure So best if you can kind of create a temporary short-lived password Or token with the longer one that you've stored in your in this the secret playbook so Summarize Yeah, so we'll leave you guys with a quick retrospective and summary of our journey with Zool so far So sorry So despite all the good we've experienced Migrating to Zool. It is worth highlighting some pitfalls and pain points You might have if you are considering this migration for yourself As you can tell and as I mentioned before we no longer have a dedicated team to manage our CI So the the automation for Deploying Zool itself is pretty minimal We kind of just threw something together really fast in like a month or two and it Surprisingly worked really well to the point where we actually it was so stable We actually didn't need to touch it again. Hence why some of our automation is actually kind of lacking It's something that we're trying to work on fixing but nonetheless, it's the current state of things for us As a result because our playbooks are not quite as good We don't upgrade as frequently as we would like so we are using a bit of an older version of Zool And that is something that we intend to fix at some point as well Also for us in particular We don't have a great Integration with company tools like JIRA slack confluence, etc Because we were just so focused on getting our open stack running in our own Zool running that like you know We we don't have time to figure this stuff out It's something that another team would be great at figuring out, but we just have too many things to do Another thing that we kind of noticed to and in part of how we're fast We're moving is a we're noticing we're having kind of some significant drift between Zool and our in-house bare metal Ansible orchestration tool so think basically ants will tower essentially to deploy our own clusters We noticed that most of our breakages were happening in our bare metal integration So like while Zool itself was great and the playbooks were stable They didn't match closely enough to what we did in bare metal Which meant that Zool and in-house ansible tool for ourselves. We're not entirely compatible And again, this is partially because of how fast you were able to move The these things kind of came out of sync with each other Again, this is something that we're working hard on correcting, but things are in a much better place than they are before Yeah, and workday also has many company specific steps Which we need to follow to set up our infrastructure So even simple things like calling col ansible within our own ansible let us to creating our own Rr server where we could more easily view logs and diagnose problems So even more things that we had to set up, but despite this I would say Zool is overall great So a pain point was we don't have a dedicated team to matter CI But a win is we don't have a dedicated team to matter CI. We get to do everything ourselves We have so much power and we can move really really fast If you want a new project, you don't have to wait for like a few days for another team to help set up things for you You just do it yourself. It takes five minutes, and it's amazing As I mentioned earlier, we manage concurrent stable releases using branches So we don't have this big chef blob of code anymore. We can actually release code in an intelligent and safe way and we can release changes that Not or we don't have to depend on other teams changes anymore and As I kind of alluded to before like some of the things we were fixing and working hard on fixing like we have done some pretty Major overhauls honestly of our release pipeline several times and we didn't block anybody Honestly, I feel like if we were doing this in Jenkins We would have broken things to the point where it's like it would have taken a while to fix But Zool made it really easy to refactor Some caveats of Zool that are worth noting if you do decide to uptake Zool for your own use case Again, it might not matter for you, but it's some things that affected us and are notable So Zool as you can kind of tell is really really good at CI It's great at that, but that means it's not great at executing ad hoc tasks For example, we did have some automation For example, if we want to onboard new teams to our cloud And we ran this by kicking off an ad hoc job in Jenkins Well, we still have to migrate this away from Jenkins Zool isn't really a logical place to put it because it's more of a CI and not for running ad hoc tasks So you still do need like Antsville tower or some other kind of Antsville executor to handle these situations Also note that Zool does have a fairly high barrier to entry in terms of resources The node pool is pretty expensive It creates a lot of VMs for us and offers quite a bit of churn in our clusters But note that we already had a pretty solid version of OpenStack running on WPC So we were able to just stick our node pool there and use our existing resources So that wasn't a problem for us And the Jenkins secret chestor isn't great But at least kind of provides you some infrastructure to manage your secrets Even though it's still not super secure. You could just log your secrets Zool kind of discourages you from following any of these sketchy patterns at all by not even giving you the option to do this So it tends to try to guide you towards setting up actual secure external trust stores instead So finally in the spirit of Canada we'll leaf you with some random lessons we learned A CI is no substitute for developer work flow Just because you're using Zool doesn't mean your system is magically going to be more resilient You still do have to write tests and configure Zool to use the correct patterns as we mentioned like installing from disk Every test you hook in the pipeline is an investment that will pay off later And if you have flaky and slow jobs you definitely want to chase those down Periodic tests are still useful despite the fact that we have gating now The check gate and periodic pipelines all work together to kind of isolate where a failure may have come from Either a developer's review, the environment, or from multiple incompatible changes And finally want to emphasize that the benefits of Zool far outweigh the initial learning curve It's still honestly kind of shocking how fast we were able to move Like we were able to create our own source control and CI without having to depend on another team in like a month or two And it almost seemed too ambitious, but we did it and you guys can do it too So the amount of time we spent setting up Zool has definitely been paid back in developer hours And confidence that our changes are more resilient And setting up new projects and jobs gets easier with practice as we've kind of shown you So yeah, that's it for us I don't know if we have too much time for questions, but thank you for joining our presentation I think it's time to run, but we can take questions out the hall