 I'm a senior cloud engineer at PayPal. We're supposed to have Anand Palanassami as well. He's probably going to show up any moment. But I'll go ahead and start. In this talk, we're going to try to share our experience with you in building the largest cloud in PayPal and probably the largest private cloud in the world. We're going to share out the challenge that we had to deal building this cloud, as well as some of the architecture, some of the tools we used to build and manage this cloud. So I'm going to start with the introduction. First of all, I'm going to share with you some of the scale of OpenStack at PayPal. Currently, we have 100% of web and mid-tier running on OpenStack. Most of the depth and keyway environment is also running on OpenStack. We have about 10 plus, at the time, we're growing availability zones with OpenStack. We have about 400,000 cores and about 82,000 VMs running OpenStack, all these KVM VMs. These numbers are actually growing every day as we build more and more infrastructure. The scale of the largest deployment, the largest PayPal cloud in PayPal, is here are some of the numbers that we currently have. And in general, this cloud is probably two and a half times larger than any other PayPal cloud AEs that we have with 120,000 cores, 4,600 projects and 4,300 users. We have about 800 images and 160 networks, that's in the specific AZ. Some of the cloud characteristics for this specific AZ, we are having multi-tenancy. We have multiple VPCs with policy-based segregation. We have hybrid networking with OpenViz which with both breached and overlay networking. We have multiple cells. We also have multiple VM flavor families with host aggregate-based separation. So the challenges that we had to deal in building this environment, so this environment, we started building this environment about a couple of months before we separated from eBay. So we had to quickly build that environment in order for our users to start deploying applications, migrate VMs from eBay, et cetera. So here are some of the challenges that we had to face with, first one is the aggressive deadline. As I mentioned, because of the separation, we had to build this environment, in case Anand is key joining. The first one was the aggressive deadline that we had to deal. Usually, the deadline was normal but because of the scalability environment, it was, and we were in a lot of pressure, we had to deal with a lot more aggressive timeline. For the first time, we had to use multicells. Because of the scalability environment, we had to use multiple cells, mostly for availability and scalability. Another challenge that we had to deal with was the VM migration from eBay Clouds. We had to migrate thousands of VMs from eBay and this has to be done very quickly. Another challenge that we had to deal with was all the firewall rules that we have in place, in PayPal, especially for this new AC that we had to build. We had to probably deal with, at least, probably twice as more firewall rules than usually because we had to talk to legacy eBay sites and new PayPal sites and etc. Another challenge that we had was hardware. Because we had fairly new hardware, there was no time to test well that hardware. We had a lot of issues related to hardware. For example, provisioning those type of visors it took us a lot of time because of the networking issues that we had to deal with. For example, even provisioning, when we had to pixiboot the hypervisors, for example, the second nick when it was initiated by the BIOS was power-cutting the first nick, so basically the pixiboot was interrupted. So we had to deal with it in a timely fashion to address these issues, talk to the vendors, find fixes, etc. Later on we had, for example, storage controllers lock up that they also have to address quickly. For the architecture and this cloud environment, I'll let Anand talk about it. Hello, everyone. Good evening. Yeah. So basically, the architecture itself, very simple architecture when compared with any other cloud providers that we adopted for. So regions and sales and availability zones, and within that is actually we have different varieties of network architecture all the way from 10-year-old network architecture to the current generation network architecture. But what we did was actually we didn't take any shortcuts in terms of abstracting out all this different concept. That whatever actually the users are dealing with, you know, cloud APIs, we just abstracted out all of them. So it's going to be just very simple for the end user. They don't even know whether we are running within a VRFR, we are running only the firewalls to separate out our security segments. So I'll just drill down into some of those, you know, our network overview point of view. So every rack actually, whatever we buy, it has four tasks which are active and then active standby and depends on the generation of the top switch that we buy, you know, some places actually we have, you know, we have capabilities enabled or some places actually, you know, just access switch. So nothing fancy. So in terms of, you know, the SDN itself, actually we took an approach where the same SDN controller supports both our last bridge. So wherever actually we have concerns about, you know, latency, you know, we don't really use, you know, overlay, we use only the bridge. But the idea is actually, you know, when we move along, once, you know, we are able to scale SDN controllers and then, you know, latency, whatever actually, whatever SDN controller that we are running today. So once it scales it out, then we are going to be moving every production workload into, you know, overlay. But currently what we run is actually, you know, bridged to most of the, you know, mission critical workloads, but overlay for non-mission critical workloads, but they're also production. So where the latency is not that sensitive, right? But as I mentioned, actually, you know, we have LACP and then VRF and some places actually we don't even have, you know, this kind of, you know, network switches. So, but, you know, we just, you know, used exactly, you know, the network infrastructure, whatever we built over the period of time, and then, you know, install WAPN stack on top of that. And, you know, we leverage the existing firewalls and then load balancers being a secure, you know, payments industry. We always care about, you know, the compliance and then security and stuff like that. But we can't get away from that. So I have, you know, one more talk tomorrow at two o'clock and, you know, I have, you know, more details into, you know, how we separate out our VPCs and stuff like that. I'll talk about more about that. So, if you guys are interested in, you can come in there. So, and also, you know, as Kaleen mentioned, we adapted multi-cell. The reason is actually, you know, we didn't want to build a giant availability zone just having just one control plane and one RabbitMK cluster. Because of the scale issues that we ran into, so we separated out into eight tracks into a set, eight tracks, around, you know, 800 hypervisors into a particular cell. So that actually you have, you know, one particular small domain to manage the scale better. So today, actually, we have one of the largest ASEVDs around the 25,000 hypervisors that we are running. It includes, you know, Dev QA production, non-production and some external cloud, actually people deploy some, you know, block post and stuff like that. But all of them run in the same cloud, same availability zone. But if you look at, you know, the every security segment, actually, we call it as VPC. VPC, you know, it's very similar to, you know, what AWS has today, right? So, but the controller, it manages, is very same across the board. Whether we run production or non-production doesn't matter, but the security domain itself, it's being controlled by, you know, firewall, security rules, and then, you know, load balancers also being separated out. And one interesting thing, actually, in this whole thing is, you know, storage. So how we are going to be running a same storage cluster for, you know, storing your compliance, you know, files as well as non-compliance files where very critical for, you know, meeting PCA standards, right? So one thing, actually, what we haven't solved today is actually, you know, having the one cluster where you just store everything and then pull it out. Because we have today one for compliance, one for non-compliance. And that's how, actually, we are deploying it today. But once we mature more and more, and we are looking at whether, you know, we can separate it out, the network itself in the same cluster, and then dynamically move the object files in and around where the same disk doesn't share, you know, sensitive data and sensitive data, right? So we don't apply that way today. And we also got into, you know, some set limitations, actually, lack of, you know, a lot of issues. It's the first time we are deploying cells, and we run into a lot of synchronization issues between your, you know, cell controllers and then your availability zone controllers, because availability zone manages more than one cell. And, you know, there are some data synchronization between cell and, you know, availability zone when the message got lost or whatever, then they don't sync back, actually. We have to, you know, manually do a lot of, you know, tweaking so that, you know, whatever data you have at availability zone, you know, comprises of all three, you know, cells that we have, right? And again, you know, if you are running, you were, if you are exporting your controller to be running at 99.9, that's our, you know, expert goal that we wanted to run, where all of our other layers sitting on top of our IIS layer, like, you know, pass or maybe ops tools or there are, you know, a bunch of, you know, other automation being, you know, running on the IIS cloud, like, you know, your message cluster, or, you know, we have, you know, some places, you know, CI itself running. And then our old Q environment, actually, we have built, you know, a lot of automations around it so that, actually, we run our, you know, C infrastructure at scale for thousands of, thousands of developers we have in the company. So our, you know, API is very much, you know, it's just given. So you can't just say, hey, API, we don't care actually 80% and then, you know, you just go and run your, you know, data plan as long as it runs, it's fine, it's not about that. So you're trying to spin up, you know, 100 VMs. You know, we expect 99 of them comes up at least. Right? So it was not the case when we started. We were around 70%. Right? We had definitely high 90s today, but a lot of things actually what we put together in the last three, three and a half years of learning, you know, how you can, you know, scale them up. At the same time, how you can make sure that they're always available. And, you know, we got into, you know, multi-master replications and then, you know, WIP for, you know, failover persistence, you know, whatever actually you could, you know, you could take from the infrastructure side itself. Even though actually you could say, okay, I just don't need load balancers, but we are not there yet, right? So we have to put the load balancers, even one node goes down, actually, the other one is taking over. And if you are running an active standby, making sure that actually, you know, you are running behind, you know, some kind of GTM, it automatically failsover to another database server. So whatever the best practices we are, you know, adapting for our, you know, PayPal applications, we just adapted the same thing for running OpenStack also. It's nothing fancy, right? So also, you know, we run, you know, Galara clusters wherever in a global Keystone, we have multiple availability zone, I think, you know, Kali talked about around 10 availability zone. And we don't want to have, you know, multiple Keystone endpoints, right? So as a cloud user, you come in, you just talk to only one Keystone endpoint, but it gives you back the catalog of, you know, the endpoints so that actually you just go to one particular availability zone, you look up only that particular catalog, you just go back and then refer to that particular, you know, NOVA or maybe Neutron or the load balancer as a service or DNS or whatever, right? So we put together a very availability zone, but the end user perspective actually they just come and hit, you know, only one URL. But the problem for this is actually, you know, if you don't have something similar to GTM, you know, you'll be authenticating somewhere in East region, but your token will become, you know, go to somewhere in West. So what we did was actually, you know, we took the GTM global traffic manager. So if you are talking to, you know, one particular availability zone, we'll make sure that actually the Keystone endpoint, you know, you go back to the same availability zone. You don't, you know, go back somewhere else and authenticate, but we synchronize the token across the board. So even that particular availability zone goes down, then the token is already replicated because the Clara cluster across the world actually, you know, we just, you know, replicate across the board, right? So Clara cluster was used only for the global Keystone today. And the RabbitMQ again, actually we run, you know, Miracuse and the Neutron DHCP actually for a sync and paste maker for running HA. So there actually one of the, you know, the standby goes down, it automatically takes over. The active goes down, standby automatically takes over. And at the end points we talked about actually, we run, you know, multi controllers behind the total balancer web. And even one controller goes down, actually, you know, others, you know, continue to, you know, work as it is. Right. And we have, you know, ECB checks for them as well. And migration? Yeah, you can do it. Okay, so the migration was one of the, you know, recent pain that we went through. We had three different availability zones. And this, you know, migration was done for, as part of the eBay and PayPal companies split. So both, you know, eBay and PayPal developers, they shared the same, you know, their VPC. They, when QA was shared by, you know, both eBay and PayPal teams. So now we have to separate it out around 8,000 plus VMs, and then close to two petawit of data. That's volume data, right? There are a bunch of, you know, local data as well, and each and every VM that we had to migrate, right? So we literally had only six or seven months, seven months, right? So the company announcement itself actually, the entire split is around seven to eight months. And we did talk to multiple vendors actually, hey, we wanted to migrate 8,000 plus VMs, and also, you know, petawits of data, right? So they came up with the, yeah, we did a lot of, you know, projects in the past, and, you know, we migrated thousands of VMs, and we did ask actually how you guys did that. Yeah, so we have a tool. You have a source cloud. Go and, you know, find your flavors. We created the target. Then how did we migrate the data? Yeah, we worked with the developers, and then they kept the data. So we put, you know, 10 or 15 people in the support team, and then they just go and help them if the file system doesn't come up for, so whatever reasons actually, you know, they had here and after as it actually to migrate 2,000 VMs, right? That's only just creating the VMs and supporting them. And they didn't talk about our volume data or maybe the object storage. So we got only two months laterally, right? Because first of all, to move this, we didn't even have the cloud. First we have to build that within six months. That's the largest AZ, around 2,500 guy provisors. We didn't even have a network part for that. All together, you know, completely new network circuit. And that itself took us, you know, three months to build it, because this has, you know, a lot of complexity in it. Because of the, you know, the way that we are going to be running multiple VPCs there, we wanted to make sure that actually we are buying the network gates that supports VR and all those things. We are also, you know, trying out new, new generations of network gears. So it took, you know, three months to build it. Then I think we are, you know, splitting it sometime in June or July. And we completed the build out itself by April end, correct me if I'm wrong. April end. And we got eight weeks, right? So no tools, okay? When I started asking for one and a half years. So we wrote a tool called, you know, Flyway. So what it does actually as a provider, you have access to the hypervisors. So you go and, you know, get all your, you know, source cloud flavors, create the VMs, and, you know, and after that actually, you just turn it off, right? And then copy the disk file all the way from source cloud to the target cloud. The problem was we had, you know, three different availability on three different regions. You have to copy all the disk files from three different locations geographically distributed. Now three different regions, literally. And we have to copy it to one location, right? A lot of, you know, network, you know, network timeouts or, you know, the R-Sync itself fails. You know, we ran through a bunch of issues actually. And also, you know, the disk file itself sometimes, you know, varies around 40 gig or 50 gigs. Sometimes people have, you know, 500 gig. How are you going to be copying the disk file? Right? So the team did a fantastic job of, you know, creating that tool. And, you know, it's completely migrated within, you know, eight to ten weeks. Eight to ten weeks. So that's what actually we got it. And interestingly, we migrated so much of VMs. And when we bring it up, we have a lot of other issues as well. Because people might have used their own password or maybe, you know, the password itself different between two different companies now and how they are, we are going to be making sure that actually they could log in back, right? So how securely you are going to be, you know, managing those keys and stuff like that? It's going to be very, very, you know, tricky. So, and it took us, you know, more time after the split, you know, supporting them actually, you know, in terms of, you know, how they can log in. If they cannot log in, actually how we are going to be supporting them in terms of thousands of, thousands of VMs. And he will say, hey, okay, VM is coming up, my application is not coming up. Whatever actually give us testing. It's not a standard application that we are running in our production site. You have, you know, a standard pattern, actually how the applications you can bring them back because they might have tried some sample application, test application or whatever. And you don't, you have no clue about actually, you know, how you can, you know, bring that back. A lot of hand holding. So the building, the tool is one thing and then migrating this data. And when you bring it up, actually, they might have some post install and they may not even run because they might be connecting to some IP address that are pointing to eBay. And you don't have access anymore there and how we are going to be supporting it. And the LDAP itself might be pointing to, you know, different network and you don't have access to that anymore. Right. So, so a lot of, you know, and, no, along the way actually we did some tooling also for that actually, you know, how you can, you know, securely log into those and then identify what are the files that you need to touch and then go and modify them so that other people could log in and then use it. Right. So, so this is our, you know, very, very high level, you know, architecture actually, you know, how the flyway works. So where, you know, it is running in one particular, you know, availability zone. It is, you know, talking to multiple cloud and it is posting the message and then the file copy itself will take, you know, hours and hours. You want to wait for that, you know, message to be back so that actually, you know, you can go back and update actually this particular tenancy itself where, you know, hundreds of hundreds of media mesh and then he migrated successfully. So because, you know, we don't want to, you know, run them and, you know, just sit there for hours or to update some status. So we exactly, you know, followed the same OpenStack architecture where, you know, completely loosely coupled when the particular agent is done, it is going to, you know, post the message back saying that, okay, I'm done with my file copy, you know, you go and, you know, do, you know, other processing in the, whatever is the next, next in the, you know, workflow. So the tool is being over sourced and you guys could try it out actually. It's in PeppalGrid, right? PeppalGrid, yeah, it should be in the PeppalGrid. So, so, Kalin, so this is one of the tools that we wrote for just for the migration. To run this many availability zone, we have a bunch of other tools actually, you know, that Kalin is going to be talking about. Yes, basically to build that environment, we had to use a lot of tools. Some of these tools were open source tools. Some of the tools we had to build in-house because we had to customize, that could be customized for this environment. And I'll mention that Flyway is one of the tools that we built in-house and we will open source it. I'll start with some of the open source tools that we used for provisioning. We used a Calvlar. In general, it's Calvlar, but we have our own tools on top of Calvlar. I'm gonna talk later. One of those tools that we use for provisioning, for configuration management, we use Puppet and Salt, Salt mostly for orchestration. Puppet we use for config management of the systems as well as deploying OpenStack code. For monitoring, we use Zabix. We've been using Zabix for years and we still continue using it. We're happy with it. For graphing, we're using Graphite. Some of the in-house tools that we built for Cloud Health and Metrics, we use StackWatch and StackMetrics. I think I'll cover some of them later but for Cloud View, we use Cloud Info for capacity reclamation, Cloud Minion. For server remediation and provisioning, we use Reparo. Actually, that's the tool that we use. Basically, we call Calvlar from it and to provision the bare metals. We also use CMS, which is also in-house. It's like a configuration management database that we store a lot of information, for example, assets, network information, application, et cetera. So for Cloud Health and Metrics, we use StackWatch and StackMetrics. StackWatch is basically, it's like a continuous live testing of the cloud. Every 30 minutes, we create 10 VMs on that cloud. If there is a failure of one VM, we basically receive alerts that something's going on. We also track all this data and use StackMetrics for graphing, basically to see in timeline all the issues. Basically, all this is deposited in the time series to graphite. And to visualize all this, we use Grafana. The cloud info is basically, it creates aggregated views of all cloud resources for that particular AZ. It aggregates data from multiple OpenStack databases and multiple database tables. It's also caching that information for quicker viewing. It also provides Metrics views mostly capacity-related or views. For server remediation and provisioning, we use Reparo. Reparo is used for different purposes. One of them is automated onboarding the hardware. It has continuous health monitoring of the bare metals. We can also use it for patching. We can patch the bare metals as well as the VMs for Reparo. It also helps us to repair the bare metals and it also flags for human interaction if there's a problem with a specific server. And as recently we started using it also for provisioning basically Reparo calls Calvary, which in terms of its API, we do a bunch of power resets and we get information about the server, et cetera. For configuration database, we use CMS. CMS is also open source. It was a project that was started at eBay and we're using it. It's basically database of all assets, physical and virtual. We sold, as I mentioned, network information, application information, everything that can be stored about the infrastructure. For capacity reclamation, we use Cloud Minion. It's also in-house tool that it's also open sourced. It identifies VMs on the cloud by examining the network traffic. And if it finds a VM that's unused, basically it flags it, it sets an expiration date. It notifies the user that the VM is unused and the user then takes an action. If the user doesn't take an action, the VM is shut down and later deleted. And the user also is provided by a web GUI that can manage the expiration dates, sets, and basically the user decides whether the VM is used or not. It also generates capacity reclamation reports. We also have ongoing projects after we build this environment. Now we have more time for reliability and basically do more stuff on this environment. We basically start working on several different projects and I'll let Anand speak. Yeah. So far in the last three and a half years, if you look at the team itself, very, very thin team, and we've got more than 10 plus availability zone and most of them are missing critical availability zones taking payments today, right? And that's not just only the provisioning actually we talked about in terms of the network reachability and then not taking away the security zoning that we have in place, right? So within this whole boundary, we just didn't build. Completely agree and then move everything from the existing infrastructure into the new infrastructure, right? So one thing that we did was actually just take the existing infrastructure as it is and use as it is, but just change only the provisioning system with OpenStack, right? That's why we were able to move both the needle basically and build the cloud at the same time in a migrated as well. And also we didn't wait for every applications to be migrated to the cloud ready. Even today, actually, if you talk about a couple of VMs went down and some applications still cream because their capacity is going down but in the cloud world, no one cares about two VMs, right? But some of the applications, most of the applications already got migrated but three years, three and half years back when you talk about, you build a cloud and people will migrate to it but it is going to be taking another five years that we didn't want to do it, right? And as part of this, actually, if you look at even actually, we say actually we are running one of the world's largest private cloud that is taking mission critical products and workload and we are not in Kilo and everywhere, right? So we are running Hava and everywhere but we already migrated an availability zone to Kilo and we are going to be completely migrating all other availability zones also into Kilo. One important thing for this is because you can't just take down your API service because a lot of automation is built on top of that you have to be really, really careful when you are taking the downtime for this if at all the API is not going to be available because we are going to be adding a capacity for some reason during the peak you can't just like that, take it out. So you have to wait for your own window within this mission critical operations cycle then you have to find your window and then take this, make sure that actually complete this upgrade. So that is a huge project actually whenever you upgrade, then it is going to be a huge effort. But one good thing with after Kilo it's going to be really better because we are going to be supporting live upgrade. So we are not going to take down the control plane and you do whenever you are ready. And also we are not going to be doing every service in one window. So we are going to be taking, hey, I'm ready in Neutron I'm just going to be upgrading only my own part. Okay, when you are ready the Neutron team is going to be doing that. Everybody is going to be, no one, no one is going to be doing that separately. So that is a huge project talking about in 10 different deployments 10 different availability zone and with sales as well. It's going to be very, very tricky and without impacting other layers. So that is ongoing and it continue to grow. And also we built a lot of automation also for migration itself like complete CI CD and then running all your tempest test cases or bunch of other tests making sure that actually nothing is breaking. And also we have our users also do the same testing before we wire it on, all this new APIs that we wanted to make sure that they're not going to create any other problem. And we also have masterless puppet. We ran into a puppet master scale issues and we are moving away from the master, right? So we are not going to be running the master that project actually Kalini is running it is going really, really well. And control puppet runs and deployments because today, okay, you have a configuration you change something in your puppet module we can't just like that say, hey, I checked it in it is going to roll out everywhere. And if it is randomly deploying everywhere and if there is an issue then you can't just like that roll back. So what we do is actually, we don't automatically deploy all this configuration configuration. So we make sure that actually we deploy it in 5% or 10% verify everything is good the new configuration and then go and roll it out everywhere in the first manner, right? So the completely controlled deployment that's where actually we use salt or maybe Ansible and all those things so that actually we can orchestrate this deployment over the period of time rather than being I checked in okay, it's automatically going to take it over. So then you could ask very well, okay, why are you running puppet? So that's the whole reason of running puppet you let it upgrade and then so that actually it's going to be all your configurations are going to be very, very up to date but it brought in a lot of other issues for us. We wanted to make sure that actually we have the control deployment once we are ready with all the CI pipeline make sure that actually your code and then in our puppet modules going together there is no difference in between these two then we will automatically go on and roll it out but not at the expense of just taking down your production APIs, right? And also we are building an infrastructure easy. So what does that mean? So to build a new availability zone today so we are taking anywhere between four to six weeks but we wanted to bring down to a week, right? So if you are building a new availability zone completely from all the way from scratch the network part is ready and after that actually you want to bring in your new set of racks and you wanted to have control plane also running in all those different racks so that actually you can build your ACs really, really fast within a week. So we are putting together a lot of effort to bring in there is a remote OC that is going to sit in and it is going to build some other AC by using that information whatever you have it there, right? Also it is going to be going to be used by the containers project as well so that we test our own control plane before even go and use Docker for our main production systems. So it's a small cluster and we wanted to scale it out as well. So today actually if you have 10 VMs, for example your NOVA is running behind 10 VMs, right? You wanted to scale it up automatically today because the load is going to increase. Today we have to do it manually but we wanted to auto-flux up that as well, right? But the AC is going to be a project. We made it as a project and it is going to be solving most of these problems that we have today. And homegrown albaster, we built our own load balancers as a service that is running in both EB and PayPal today. And since they're all internally custom built APIs and completely restful APIs managing both production systems today. But since we wanted to be in the community at the same time, we wanted to enable all our vendors to write the new capabilities as the drivers and then plugins. We are moving away from our own, even though actually we put three to four years of effort and are completely measured and we are moving away from our existing load balancers as a service into community albaster. We are running our dev and QAVPC with Elbas and then production is still running with our own version of Elbas. But the next year and year and cap actually we are going to be completely migrating into that. And conflict management, this is very important piece. Since we are not running puppet, okay, automatically. So it is going to introduce some conflict route for you for sure, because when you deployed your code one of your guy provisional might have been down, right? That it came back now, but it is going to completely have a different set of code base at the same time the configuration is going to be different, right? So now how we are going to be managing this, right? We guess, okay, turning off your puppet master auto deployment is completely fine, but it is going to introduce some other problem, right? So we are building a conflict management. So it is going to give you complete visibility into your infrastructure. How many hypervisors you have? What is your current configuration, including your BIOS version or, you know, including your OS version, what's your kernel? And it will give you the whole view of your cloud deployment. Then it will be very useful for operations teams. They can go and just take an action on top of it, actually. Today, if you ask, actually, you have to go and, you know, collect a lot of statistics about it so that you know what's going on in your infrastructure. So this project is going to, you know, give us more visibility into our infrastructure, actually. What is in there? Yeah. And also, you know, we have somewhere, you know, this again, it's masterless puppet. So this masterless puppet is going to have a control file. This control file is going to be, you know, used by a very client and the hypervisors. And it will determine, actually, when to deploy and what version of the OpenStack code and what version of the configuration. So it is going to have that information into the file. So in ProcessRaisy, if we already talked about it, okay, what will it contain? It is again, you know, another OpenStack environment. It itself, you know, it's OpenStack deployment, OpenStack and OpenStack. It's very much similar to that, right? And the cloud control plane on touch capacity add. This one actually very, very key for us today. Okay. Holiday is around the corner. And we want to add 30% of our capacity, what you are running in production. And of course, you are going to be buying more and more racks every year. So how much time you take to, you know, get your rack into your data center and how soon actually you can, you know, get it into life? Because there are a whole lot of things that actually has to happen in this whole, you know, workflow and pipeline. And we have a bunch of tools already built in, but still, you know, we are not to the level where we want it to be. And basically, you know, we want to do it within, you know, within one or two days of, you know, getting the rack into the data center. But it takes a week today, right? You wanted to bring down that, you know, two days. So that actually, you know, you are maximizing your investment because, you know, you have your hardware lifecycle up three to four years. Every day that you are going to lose, you are losing the money. So you wanted to definitely bring it down to, you know, day of deployment whenever it reaches the dock and from the dock to the floor. And, you know, how many tests that you could run and how many configurations you have, you know, switches or hypervisors, how soon actually you can deploy all of this, you know, different, you know, software and different, you know, tools that you wanted to run in, you want to run your hypervisors. How fast actually you could deploy and then verify and then add it to your existing capacity, right? So there are a bunch of things to happen all the way from your dock to your floor. So there is another project going on and it is going on for a long time and, you know, it is basically, you know, optimizing this, you know, cold workflow and, you know, introducing more and more efficiency to this workflow. What is the other one? Okay, now time for questions. Yeah. Yeah. So close to 30, 30, 30. We started with one engineer. I think I'm going to be talking about that more to the whole journey. But, you know, we are around 20, 22 engineers and then eight operations teams, very, very slim team. So that's one of the reason if you look at, actually, you know, we don't have a lot of contributions to the community, but once we have, you know, the live upgrade and then entire CI pipeline everything is set up. And if you are going to be in live upgrade mode and you're just free of yourself to be in the community. Great. Your operations is 10 people and one is Dev. Yeah. Yes, sir. Yeah. So that one actually has very interesting piece. So we started with that, right? But we deviated from that for a reason because, you know, our deployment, you know, topology is something different from what the vanilla one has. At the same time, you know, we wanted to preserve our passwords and stuff like that in the capital mode. And we wanted to have more control over puppet modules. That's one reason. And another one, actually, we wanted to have, you know, a very component as a separate module that we recently refactored completely. Where, you know, whenever you deploy neutron, we have to pull the entire tree and then go and deploy it earlier. And you don't know what other people changed. You are going to pull in some unknown changes you are going to be deploying that introduce some of our issues as well in the past. So we, you know, refactored it recently as part of the key live upgrade. So I think, you know, if we have time, actually, we are planning to open source this as well. But the time is really, really too short for us from managing 10 plus availability zone. Yeah, so that's what actually we should do that, actually. Yeah, we'll plan for it for sure. Yeah, yeah, tripped. Yeah. Yeah, orchestration. Yeah. Yes, so basically, currently, we use Sol for orchestration, which Sol basically calls puppet to run control puppet runs. But we're switching away from that model. We basically creating a YAML file, control file, which basically is going to be pulled by each client, for example, hypervisor. And basically that control file is going to tell the client whether to run puppet or not. And basically that way, it kind of a timestamp that you can ensure that everything is up to date. Yeah, so basically, Sol by itself has issues, especially on the large scale, especially that the round that we currently have. We like Sol, we use it, but for config management, drifts to ensure that everything is running on time, we have to switch to master's puppet force of all. For example, we have to upgrade to kilo, we have to quickly upgrade all the hypervars that we have. And basically to control that, we plan not to use Sol to trigger the runs, but basically have this control file, which is going to be pulled by each client. And then, yeah, this is what we're currently working on. It will probably be working condition very soon, yeah. If you guys want to take it offline, maybe, I think. You know, the next one. So one more question, yeah. Oh, how many? You want to talk about that? Yeah. How many dollars that you saved? So actually I talked during the last summit, when we usually deployed it, we found 40% of the VMs as in used. So that's in depth, we actually are talking about. This is just on the cell service. Exactly. Both. Parts. But production VPC is actually, we have capacity team and they need to manage our site capacity and stuff like that. That's totally different. Anything else? Okay, thank you. Oh, one more. What version of Sol that you are running? Storage. Storage? Yeah, storage. Oh, okay. So we have both volumes and an object storage. Volume actually, we have solid fire that's the backend today. And we are using Cinder. That's one of the largest use case. We have our DevKey and then production also. We rolled out recently, wire code deployments. And for the Swift, actually, you know, PayPal doesn't have a lot of use cases for object storage. I store a lot of tens of thousands of images. But we have significant usage that is increasing and we are going to be increasing our Swift cluster size also. So it's already on the way. Yep. Anything else? Okay. Thank you.