 Welcome to this presentation about Workday's next generation private cloud. This is Silvano Bubak and I'm Jan Gitter. We're engineers from the Dublin office in Workday and we work in the Workday private cloud team. Workday is the leading enterprise cloud for finance and HR. This is from the 2021 fact sheet. At Workday we take a lot of pride in maintaining customer satisfaction. And this cuts across the entirety of the organisation. So while our infrastructure team provides only a kind of indirect way towards this, we are very conscious of our responsibility of what we need to do. So Silvano and I joined the WPC team just as we were transitioning from the third generation of the cloud to the fourth generation. I think I joined the week when they sponsored the last third generation cluster. From an organisational perspective, we have gone through this process before and we strive to make it smoother and less impactful on every transition. So most of our workmates know us by the WPC acronym. We are the team building and operating the open stack clusters inside Workday infrastructure. You'll see we are a small team relative to the size of the Workday organisation. We're only eight SREs and supported by nine developers. We provide the infrastructure for all the service tenants that eventually make up the Workday application. Our SLO target is to successfully handle 99% of API calls within timeouts. And that is across the week and we have really, really high responsibility in order to do that. You'll notice that our cloud is built up out of 87 clusters. So we call these smaller clusters, but there are about 300 compute nodes each, totaling about 2 million cores, 12.5 petabytes of RAM. Our application is incredibly RAM-hungry, so the machines actually take a sizable amount of time to do the power-on self-test and run through the RAM checks. We usually have about 60,000 concurrent instances running across the cloud at any given time. But during the week, we have about 241 recreates. Now we call a recreate where an instance shuts down and something restarts the instance. Usually after some maintenance has completed between them, maintenance could be like really simple. You also see that this particular clustering or this particular design has been kind of consistent across our generations, so we've always kind of had multiple smaller clusters rather than a big one. Our team's purpose is to provide a very resilient platform for the Workday application. We have close neighbour teams that interface with us. One of them provides the platform as a service and the majority of our production services run on that. We have around 300 compute nodes per cluster and we try to keep our production clusters as uniform as possible. Something Workday shares with a lot of companies that provide banking services is a regular weekly maintenance cycle. So this provides us with some unique opportunities and challenges each weekend. On each cluster, each service gets its own OpenStack project. So this helps with service isolation and simplifies security a little bit. Workday has this concept called the power of one. So this is the Workday application itself that has this attribute and we roll out some of our, we roll out the same updates to all of our customers at the same time. So we don't have a special cadence between customers. All of them get the same updates and security fixes at the same time. And we have a very regular consistency which we do that. So every week we tend to roll out fixes and updates and so on so that it's very predictable for our clients. Weekends have usually been our most busy time. We try to do as much maintenance as possible in a short window as possible to minimize any kind of impact to the SLOs that we achieve. And as a platform, usually our burstiest time is over the weekends and during the week it's at a lower rate. We don't have much burst capacity in the compute sense. It's a common pattern for our services to be deleted, do some offline maintenance and then start up. And that's usually like a compromise because if you do online maintenance sometimes for these things, depending on your architecture, it could take up to a day or so to do online maintenance where the offline maintenance is relatively short, can be performed sometimes in minutes or so. So that is much more like a historical artifact than a design that we get to choose. Service updates are moving out of the maintenance window and this trend has been continuing. So it's important to keep the control plane available throughout the week as well. Workday has been shrinking this and the trend is definitely going towards much, removing the burst from the weekend and moving everything into a smooth process throughout the week. We don't just run production clusters in Workday, we also host development services and provide a benchmarking platform for our application engineers. We treat everything as high security with multiple levels of access control. Our image bold validation and replication services run on a weekly cycle as well. They coincide with our upgrade cycle. Our development clusters have a relatively small amount of tweaks compared to our production clusters because the development workload is a little bit different and one good analogy if you kind of think about bulk stuff, we're operating a fleet of clusters, not a fleet of hosts, which is kind of scary. This brings us to our latest generation of the private cloud. So our fourth generation scaled extremely well and it was heading into its expected design limits and in 2020 we started to plan ahead for this current generation. You'll notice from this slide that almost every choice has an adoption cost but we felt that it reduced toil and work in our maintenance enough so that we can take all of these. And you'll also notice that a lot has changed under the hood but our interface has remained largely the same for our customer. We made the decision early on and this is generations ago to use forklifts rather than in place upgrades. Because of that, partly that's because we have the luxury of being our own client and partly because this allows us to make fundamental changes to things like networking which is very difficult. It would be very difficult to change from L2 to L3 type of networking with an in-place upgrade. I don't know how I would do it. So WPC-5 is also our first 4A integrated development. We did talk about that a little in our lightning talk yesterday about Zool and if you get me started on Zool, I never stop. Our default test jobs for Zool looks something like this. We start off a set of jobs in parallel that do bolts. One of them bolts the color containers. One of them bolts a deployment container with a set of frozen deployment playbooks and the other one starts provisioning a virtual cluster. In the virtual cluster, we set up a virtual BGP overlay network inside it so that we can test our overlay networking. Not overlay networking, we can test our L3 networking. So once the containers are ready, the deployment job continues, pulls them, starts doing a virtual cluster deployment and eventually goes right up to running a nested virtual machine testing connectivity with a nested virtual machine and all the bits and pieces of our integration. This process from source right up to test completion takes about 90 minutes. It takes a little bit longer just before the sprint ends when everybody tries to close up their Geras and merge their code. We're still in the process of converging our multiple workflows. Zool is a separate workflow than our deployment workflow. We have deployment tests as well, but the more common stuff we put before the gate, the less problems we have. So when we started with Zool, we kind of incorrectly assumed that the community bold and test jobs would be just running on our system, but it turns out that those are semi-tied into the community infrastructure as well. We also have some unique stuff. We use CentOS, we don't really have access to Ubuntu in our data centers, we don't have internet access to the community repositories, so we have to create local murders and so on. So something we were quite pleasantly surprised with was the ease of which we could essentially rewrite our deployment pipelines in Zool and just change the way we bold and connect stuff, and that didn't cause any disruption in our development because the release pipelines and the development pipelines are on separate branches and the management is done very much with very little toil for us. So that's why we also shifted for a branch for release model in WPC-5. Adopting this model was made significantly easier because of the excellent tooling that the OpenStack project already has. In our previous generations we used to have a bundle release where every week we created the entire bundle of all the Chef cookbooks from the organization, the Chef cookbooks that deployed OpenStack, the code that we created and took a checkpoint at specific places and promoted that across as a release through a pipeline. The problem with that was we tended to gate features and bug fixes beyond config changes. That meant that the amount of branches inside our code was something crazy because you would deploy the code and then you would change the config settings in the cluster in order to activate that particular fix or that particular feature that you made. With WPC-5 we decided to integrate all of that into a particular WPC-5 version and hopefully that causes us less branches in our code and less headaches for our SREs who are doing the rollout. So it should also come as no surprise that Workday has needed to build a lot of its own infrastructure. Some of this is historical because some community projects didn't exist at the time and some of them is because of our unique environment inside of Workday. We have our own DNS infrastructure running outside of the cloud. It's rock solid. It's never DNS in Workday. Our instances automatically gets DNS entries. So our tooling uses DNS names to resolve things. We rarely need to fall back to IP resolution or registration. To check compute node health at scale, we wrote a small service on each cluster that goes and boots a VM on a random hypervisor, tries to connect to it and shuts it down. This is surprisingly simple and effective because hypervisor hardware failures is a fact of life. So this is one of the nice ways that we can detect blips in our infrastructure and keep the service up and running for high levels. We have a service managing IP address pools. We tie that into the way we create our neutron networks. We have our own internal CA and we take TLS everywhere pretty, pretty seriously. We have our own Ansible orchestration service to run and deploy playbooks. So our production servers and so on live on an isolated network and they use this service. And also Solvano, a pioneered project that aggregates and generates an overview for all of our clusters. This tool is immensely powerful and gives us an insight to the users of an infrastructure. So we use this specifically to check for capacity planning, debugging and tracing. And it's literally one of the one-stop shops that SREs do in order to figure out which virtual machine is from which service is running on which compute node, which cluster where. So most teams in Workday don't deal with us directly. They deal with the platform as a service. And this is usually closely tied with the Workday application itself. And it consists of a wide array of common tooling services. We have an image building service that generates images for the various services and tests and ensures that they get replicated from the engineering clusters to the production clusters. And we have tooling that validates everything in teams that run validation across the different things. As you can imagine with a sheer amount of clusters, we also have a sheer amount of services. That's not a rat's nest that I'm diving into anytime soon. We have our own bare metal provisioning service. We're not the only users of bare metal in Workday. The bare metal provisioning service is deeply tied with Chef and our DSIM system for bare metal lifecycle tracking. And then finally, we have some application-specific tooling that helps us to deal with deciding which cluster to boot which instances. We use this to map Workday customers onto the physical resources needed in the clusters and ensure that we've got adequate capacity for that and we can plan in the future where to onboard which customer. And now I'll hand you over to Silvano. So my goal is to give you an overview about changes we did in Cola Instable. So like I mentioned before, we use Cola Instable to do the deployment. We are based on Victoria. So the changes I mentioned here, some of them already exist in the master branch in Cola Instable, sometimes completely full implemented, sometimes not co-implemented. But just to highlight, I'm talking about Cola Instable in Victoria because that was the version that we work. One of the things that we spent a lot of hours was to enhance the TLS support in Cola Instable Victoria. So Cola Instable had provided TLS support, but the support is not complete. Let's say most of the tools, most of the services have TLS, but there are a few, for example, key examples. For example, like MariaDB, for example, in Victoria does not, in Cola Instable does not have TLS support. Also, another change was the way the TLS implementation happens in Cola Instable. So essentially, in Cola Instable, including their tests, they use the self-signed certificate. When you use self-signed certificate, essentially, you usually have just your CA and your search, and that's it. But if your company, for example, have a CA authority or if you buy a certificate from someone else, usually you have CA and a lot of intermediate CA's. The problem with that approach is because each service handles intermediate CA in a different way. So depending on the service, it's simple as just concatenate your search together with the intermediate CA, and that's it. Just to clarify something, TLS doesn't know. So usually, in that scenario, your hosts only style the CA route. So the responsibility of your service provides the search and all the chains are integrated until the CA route. So, and like I mentioned, the problem is each service deals that in a different way. So that took us a lot of hours, and because some service require concatenate the CA's together, some service require concatenate the search plus the CA, some service require special parameters to work because you need to tell them that there are intermediate CA's. So we did this change, and it's working pretty well. Another enhancement that we did was the observability in monitoring. So Colliance will deploy the Prometheus 1.8. We don't use a lot of features in Prometheus because essentially our Prometheus just forwarded the data to the work in the platform monitoring. That is based on Wavefront. We also provide the support. If you enable that flag, the Prometheus will send the data for a remote ride to Wavefront. But we also want to take approach of a new feature that exists in Prometheus 2. That is the OpenStack Service Discovery for Hypervisor. How that works? Originally, in the Colliance book code, when you deploy a new Hypervisor, you need to come back to your controllers to add the IP of that Hypervisor to the Prometheus, so Prometheus can go to that, for example, NodeExport and scrap that NodeExport. If you use the OpenStack Service Discovery for the Hypervisor, that is, for Hypervisor and for VMs for Hypervisors, when I deploy a new compute node in Colliance book, I don't need to come back to the controller. I don't need to touch the controller. And this was important to require for us because we want to avoid to change the controller, touching the controller as much as possible. And Upgrade to Prometheus 2, Prometheus 2 provides this for us. Also, we did some changes in the way we... Now you can, for example, add as part of an inventory, custom tags, and that custom tag will be added automatically for your inventory, as well, like, for example, I want to tag controllers as role equal control or compute group equal compute nodes, something like that. We did also some other change. Some changes I think exist. I think it will be virtual export existing in the master now. And Upgrade for OpenStack exporter because we want to use the new version to support fetch data from the placement API, bird exporter for the BGP road. And another interesting change that we did was regarding the FlintD. FlintD is already deployed by Colliance book. It's a centralized logging system that parses all the logs. But in our change, we add some rules to make FlintD parse the access log from HAProx and Apache. And from that information, FlintD provides some statistics, and that statistics are fetched by Prometheus. So that gives you the statistics about the, for example, a number of requests in each API, latency for API, number of requests per path, and our number of requests with error, for example. It's a key feature for us to enhance our monitoring because we wanted to do it to the size of the fleet. We need to be very good in monitoring to catch any issues as soon as possible. One of the interesting features that we did also was what we call single-tool containers. In short, what is a single-tool container? Imagine, for example, if you are deploying a cluster, you have three controllers. So you just need one Prometheus controller around your cluster. Usually, people sometimes choose the control zero, for example, for that. But the problem is, if you lost the control zero, you need to start Prometheus manually on the other cluster. So essentially what we did was we take an advance of the KIPA LiveD that runs in the same three controllers and use the notifications in KIPA LiveD. So every time there is a change in the, what is the master in KIPA LiveD, KIPA LiveD calls the notification and it starts and stops any container action that you add to the list. That's the single-tool container. And so when you deploy, for example, a cluster using this change, you are going to see the containers that are targeted as single-tool-like Prometheus in create state, for example, in the two controllers and in whatever is the controller, is the master. Of course, the Prometheus will be running. And if you do a restart in KIPA LiveD or something that changes the master IP, of course, the Prometheus container will stop and start in another host. Of course, changes in retry, some performance improvements and in the playbook as well. And of course, one of the, I think, nice changes we did was the support to Calico, complete support to Calico. Right now, this support is based on our needs. So that means we are building Calico using CentOSA to stream binaries because that's how we use in Workday. Of course, we build the binary in the caller, we build the containers using caller and we deploy using caller ansible. We did some changes in the neutral plugin in Calico as well. Some small changes like MTU in Calico was previously hard-coded and some clean-up in the code to support newer versions of Bernstech. We did some instrumentation in the HTTP service as well because sometimes it was failing silent and that helps. Another interesting change was the, in short, the way Calico works requires communicating with the metadata API in HTTP because it's essentially an IP tables rule. So we deploy a container, I think it's an Nginx container and in the compute nodes, in each compute node, and that Nginx container requests the metadata request and forwarding HHPS to the controllers. Most of the changes are in the neutral plugin. The Felix, that's the one component of Calico, is essentially untouched. So this is what we did in Workday. So these changes are not upstream so far, but we are planning to upstream that probably next month because we are just finishing the first batch of cluster upgrade right now. The team is completely focused on that. We need to upstream, we need to update documentation and other things that were not required before. But if anyone have any questions or wants to get more details about this change, about this implementation, you can come talk with us here. There is a microphone there or you can go in private and we can discuss. I hope you guys enjoy and thank you for your time. Thanks a lot for the presentation. I got a question for Jan concerning the bare metal provisioning component that you mentioned. Can you expand a little bit on what that actually does and how it works and how it integrates with the other components that you have? So our bare metal provisioning component is not something that's managed by the OpenStack team at all. This is a separate team that also manages infrastructure in Workday. And that is something that has been developed over many years. It's a weird set of combinations between a set of services that run Cobbler, that run Autodea, I believe, for some kind of infrastructure detection, that run firmware updates. Ansible Kickstart is not the image. Oh yeah, Ansible Kickstart, there's a lot of things. So in a weird way, it's more or less like triggering a remote install on the machines rather than the way that you would imagine, you know, all disk images or cloud in it, all those kind of things. There's various reasons for that because they have multiple clients, not just us. Some of those clients are running bespoke SQL databases. Some of them are running specific other pieces of infrastructure that can't live inside the cloud or are specifically for security reasons, isolated, and so on. So the best I can say is it's extremely custom tailored to Workday's environment. And there's hooks inside that calls out to the DSIM data center management thing and updates the state of the machine because that, again, is used by the people in the data centers to figure out what's the operations they need to do on the machines and so on. So again, this is kind of separate teams in Workday kind of coordinating using these kind of things. There's a lot of complexity there. And yeah. I imagine. Thank you. Sorry for exposing you to that. I think there was another question. I think we are we going to publish a slides? It's probably. Yeah, we can know that that's not a problem. Yeah, I don't think we've been cleared by Workday Legal, I believe. Yes. Okay. Any other questions? They're not completely air-gapped. There is a very restrictive firewall between them. So specifically, we have various levels of security in Workday and the most important and sacred thing for us as customer data. We don't even allow Workday engineers to access customer data or machines that could contain customer data directly without the express permission given by customers. So that kind of scenario, there's multiple layers of firewalls and so on managed again by a separate team. There's multiple layers of intrusion detection and so on. So in order to punch holes through this full services, it's usually you're subjected to security review and those kind of things. That's one of the things on this slide. I think every instance gets an internally routable IP address. So we have our Workday network has a route, routability between everything. So we don't have nothing between them, which means that again, the firewalls prevent access between some of the routes, but they've got unique IP addresses and they're routable like that. So the weekly patches tend to be across the organization. So every segment of the organization has a slot inside the weekly patch where there's maintenance that happens. We have like a schedule-ized things. The actual outage that we subject the customer to tends to be small, but the whole, the entirety of the weekly patch is usually like an organizational coordinated thing. So we're not, we're only a small section of that, but usually during the weekly patch, there's, once our services, once our services say we're good to go, then the rest of the Workday services tend to hammer our APIs. Let me just compliment your question. So we are an infrastructure team, essentially, and also we are a telecom system. So essentially when we boot the VM, the application starts and the application has a health check that's implemented by the team. We don't control that. And that health check is available. Most of the time, 99.9% of the time is not our problem anymore and this is where the problem starts. So as a team, our sell-out is to ensure 99% of great success. And once the VM is certified, my health check is okay, our health check. Usually the issues start when they start to activate those services and they find that the application have a bug or have an issue. And that is more with the application team. We usually help them to troubleshoot, but we are nice to say that usually it's not a network issue or something outside. And the application teams have got higher-level functionality to roll back to older versions and those kind of things. That's also one of the reasons why the capacity planning is important, so that they can have capacity planning for scenarios where rollbacks or outages happen or so on. I think any other questions? Thank you very much.