 Hi, my name is Simon Guinness, I'm here today with Jan Gutter and we're going to be talking about building Workday's next generation of private cloud using Zool. So this is us, this is us modeling faces and first we need to set some context. So we're looking at what Workday does. So Workday is a cloud HR and finance firm, so we provide HR services to 60 million workers worldwide. We have about 50% of the Fortune 500 on our books, so you can imagine that there's kind of a reasonable load on our systems to deliver the services that we provide to our end users. So to do that, it might come as no surprise that we use OpenStack. So OpenStack is something we've been using for a long time. Myself and Jan, we're part of a nine-person team inside Workday and we deliver the automation that builds OpenStack clusters. We then have a complement or another team that are the eight SREs and they actually run the production infrastructure of OpenStack inside Workday. So Workday has 87 clusters of OpenStack, which is why we need the infrastructure build or the automation to build out these clusters. Each cluster is about 300 nodes and each or the majority of clusters today are running OpenStack metaka. So it works out to roughly two million cores, 12 petabytes of RAM and 60,000 concurrent running VMs. But for us, the interesting portion is also the 240,000 odd recreates of VMs each week. So the majority of those will be shrunk into the Saturday patch. So that's where we tear down a lot of our infrastructure and we build it all back up again. And you can imagine in that time, in that window, we put a large amount of pressure on our OpenStack control planes, which is how partially we end up with so many clusters that we need to manage. The number of clusters then has a few other interesting knock-on effects. And one of them is that because of the number of clusters, because we will be running things for so long, and so in so many data centers, we need to maintain multiple concurrent stable versions. So if we release a version of our version of OpenStack, it may take a while before it gets rolled out completely to production. So if a security fix comes along, we will need to backport it into that stable version and keep things running. And I think this is where we can start to kind of see how Zool is a natural fit for us, which is a great thing, but we'll get there in a moment. We also have a 99% SLO target for our API calls in production, which is, again, interesting and makes for a challenging OpenStack environment for us to operate at any rate. But it has worked very well. So how do we do things before? So where are we? Kind of at the moment, we're moving into the new world, but what we're working on at the moment is a thing called internally called WPC4, which is the fourth generation of our internal private cloud. So that, as I mentioned, runs Mataka. It's scaled very well and actually works for our use case wonderfully. It's really stable. It's worked really, really surprisingly well, but it's starting to hit the scaling points that you'd expect, that you'd kind of understand to come along. But it did that in a graceful way and we got very, very lucky that we were able to have enough time to build out our next version and our next release. So at the moment, we're running Mataka and we're looking to move to Victoria. But for the Mataka instance, we used a shared Garrick. So that was a Garrick system that was shared across all of workday with multiple teams, which is brilliant from a shared developer experience, but wasn't fantastic from a flexibility and our ability to integrate with other points very well. So now that made life a little bit more interesting than we would have liked. We also used Jenkins. So obviously Jenkins did a lot of pre-merge jobs and it did some periodic runs, but we still ended up with a bunch of things that were kind of merged in by hand. So again, not bad, not ideal, just how it was. And then we took all of that work and we bundle it all together with Chef as a deployment. So we have all of these parcels, our automation that we've built, that we've taken from our Garrick we have assembled, and other aspects of workday, their security infrastructure and this sort of thing. They've all built their Chef cookbooks. They all get rolled together into a big point-in-time ball. That's what goes out. And that's what you get. Fine, grand. It works. It's a point-in-time release. But as we kind of all understand, Chef isn't great at orchestration. So orchestration could be kind of reasonably challenging in that particular scenario. So where do we go from there? Well, it's 2020 and nothing else at all is happening. So we've got a lot of time on our hands for some reason and we're looking forward to what we need in the future. And OpenStack Victoria seemed to be the logical place for us to go at the time. And we had a number of requirements that our CI and CD systems needed an upgrade. So we needed CentOS 8. We needed containers because obviously it was containers. But we also had a requirement that all of our systems work perfectly internally with no internet access. So the majority, in fact, almost all of our systems do not have access to the internet because large HR and finance and the internet, it's not ideal. So we like to keep things very secure. So we ended up not being able to find anything in 2020 that immediately came off the shelf and fit perfectly. But we leveraged the opening for playbooks to build our own. And we were able to quickly spin up our own Zool and our own Garrett, bring across some of our own work into our own Garrett, but also import the communities projects into our Garrett and use Zool to actually build up our own version of OpenStack and get that deployed out very quickly. So I think from start to finish, it was about two months that we were able to actually build up a version of OpenStack using Zool and get it out. And what this means is that when we look back in Mataka, and the way that we had to do things with Chef and with Jenkins, we found that we had to get a lot of our features and our bug fixes behind, say, configs. So a config change will be made. And you would have you would turn on or off a feature. What that meant was the you'd end up two clusters running the same versions that weren't necessarily in line. So that's not great. But now with Zool, the way we do it was we can do it with branches and branches and Zool nice and easy to manage the jobs, it all comes together. And we actually get a stable release. So we're looking at managing 87 clusters in a much happier way. So we didn't go through this without some pain. And these are kind of the earliest points of pain that that we experienced. So Simon mentioned, our Zool and Garrett instances are still mostly managed and by hand and deployed by hand. Partly, that is because that's been so remarkably stable that we didn't need to automate them to keep them up. But the lack of this management and CICD for our CICD means that we've not upgraded as fast as we wanted to light. So with managed infrastructure, we could integrate much deeper with workday infrastructure. So we could add Jira and Slack and Confluence integrations. But our primary goal was not to build the CI, our primary goal was to build the OpenStack clusters. And the scary thing is Zool worked remarkably well with very little intervention. We also managed to get quite a few wins. We managed to overall our entire release pipeline at least twice without blocking developers. I might go so far as to say that some of them didn't even notice. We managed to transition from CentOS 8 to CentOS 8 stream snapshots without any major rework or impact. We managed multiple stable releases using branches. So job variants and dependencies are trivially maintained and they can be sunsetted as we sunset a release. And Zool's reliance on Ansible overlapped with our usage of color Ansible and our in-nose Ansible. Operating private Zool and public Zool is very different from each other. You shouldn't really expect that the community jobs that run unit tests or bulls would work without modification in your environment. Community jobs are really well written, but they're not necessarily designed to work without community infrastructure. So for example, we don't have access to Ubuntu providing access to Ubuntu inside workday would have been required a lot of work for us. And we switched over to we wrote our own CentOS play books and those kind of things to manage that. Zool's design goals align with its use to support public open source development. So it has a very, very restrictive security model. And that is excellent for public Zool development. But such a restrictive trust model doesn't really fit nicely with us. We have the need for developers to be able to access core stuff without worrying too much about it. We're not at the scale where it matters yet. We also have internally developed tools for machine image creation. So we're not using notepool boulder. Zool has some caveats for our specific use case. One of the things we have a little bit of workarounds work for is to taste OS snapshots with releases. We want to tie a specific OS snapshot to a specific workday private cloud release. And in order to do that, we need to test the full snapshot with the kernel security patches together with the open stack release. It's hard to taste full OS images speculatively in Zool. The note pool architecture presumes that the note flavor is part of the job description. But we work around this by updating to a snapshot at the start of the job and just rebooting. Zool's two level trust model makes secret handling a little bit difficult for us. But it's still perfectly possible to build a policy framework inside the jobs themselves. So we want to have some access to secrets in check. But accessing the secrets from a trusted playbook via the executor. Initially, we thought we could generate some release builds in the tag pipeline. But it turns out the job variants behave subtly different there. In the gate pipeline, all of the job variants come from the series branch. So we're going to do that. We're going to do that. We're going to do that. We're going to do that. We're going to do that. But that information is not available when there's a tag event. So this is because it cannot unambiguously resolve branches for tags in general. In Ansible, it's also hard to assert that something has become unreachable without failing the playbook. So that kind of means that some negative tests that you want to run where you induce failures is kind of tricky. We also come to the benefits of Zool. Holding Zool jobs is definitely a killer feature. It's one of the first things you should learn as an operator. Writing jobs portably makes integrated testing simpler. If you can run your job from any repo and incorporate changes from there, it helps a lot. Zool's logging model is really awesome. Gating is awesome. Branches enable job variants, job variants enable stable releases and stable releases enable faster feature development. So mastering branches is also a good thing. And finally, Zool tries to help you to steer away from bad practices like keeping state where you should be pre-caching. There's a couple of extra random thoughts here. Your tests and your CI is your biggest investment that you should be making. Zool helps you with that because most of the work in Zool is actually writing the tests. You should not confuse the need for periodic tests and gating tests. You require both of them. A healthy CI requires both of them and requires low failure rate in gate and in periodic tests. And finally, be sure to make use of the way that Zool checks out repos for you. Prefer the stuff that Zool checked out rather than the stuff from the Git repo. That's a common mistake that a lot of folks make in builds. And if you make that mistake, you break CI and you don't realize it. And thank you very much. Any questions? Any questions? So we're using Zool to deploy OpenStack tool. So it is used for us, our cloud development environments, but not for our production clusters. That's because of the way that we have our internal systems laid out with data centers. It might not necessarily have access to everything it needs. Zool is doing CI stuff, not CD stuff, unfortunately. So we can only, the CD stuff is a separate system, entirely separate, entirely separate in-house developed CD tooling. Yeah, we, our tooling for deployment is based on COLA Ansible. And the thing that triggers the COLA Ansible is an internally deployed internal thing. Thank you very much.