 Thank you. Thank you everyone for coming. My name is Arun. I am a senior engineer at Yahoo's OpenStack team from San Francisco Bay Area. My talk is on things we have learned from scaling ironica to Yahoo. My talk has three, four parts. First, part is on our cluster architecture and background. Second part is on ironic and ironic conductor. Third, we'll discuss about neutron and neutron DSP agent issues. And fourth, we'll discuss about the importance of doing a density test. And finally, there will be time for questions. Also, please feel free to ask any questions in the middle. So Yahoo has hundreds and thousands of servers in its data centers. These servers are used to serve anything from Yahoo's friend page, Yahoo Finance, Yahoo Mail. And these servers needs to be managed by a coherent system. So we decided to use OpenStack ironic to manage all the compute resources at Yahoo. This talk is about things we learned from scaling ironic at Yahoo. Let's dive into our cluster architecture. So, hopefully it's visible there. So we have ironic cluster in every major data centers, Yahoo data centers. And in the middle, in the middle, this is our, we have like three API boxes. All these API boxes run most of our ironic API services. It runs Keystone, Glance, NOAA, Ironic API, all the APIs. Apart from APIs, these API servers also run NOAA scheduler and NOAA compute. So NOAA scheduler and compute are behind a leader election script. So we have some kind of HA going on there. And the users access the APIs through a VIP and a ATS. And these APIs are connected to a database. There are two databases in master and slave modes. And we also have two message queues. And on the right side, we have ironic conductor. So ironic conductor, we have two machines. These basically create the pixie configuration, always, you know, creating the things needed to boot and OS in an old. This ironic conductor is also connected to our OuterBand network. So this is how we basically turn on and off machines over IPMI. And ironic conductor generates all the pixie configuration and the OS kernel RAM disk and all those things and rise into an NFS share. And this NFS share is read write mounted on each and every conductor. So this NFS share also read-only mounted on a box called ITS. We call it the ironic transport service. We do not have ironic transport service in an upstream. So this is a Yahoo! specific thing. So basically, ironic transport service runs DHCP, TFTP and Apache to serve the operating system files. Since DHCP runs there, we also have the Neutron DHCP agent running there to receive and DHCP notification from Neutron. We also use IPXY. So whenever a node boots, the IPXY binary is sent over TFTP and everything from there is HTTP. So before ironic, Yahoo! had a mix of grizzly bare metal and a legacy imaging system. So we had to move all these machines, which are actually serving production traffic, powering yahoo.com and yahoo.mail to the new system. Yahoo! has an inventory database called OpsDB. This DB has information about each and every compute resources at yahoo. So each and every OpsDB entry can have a host name, macros of that machine, even it has information about in which rack, which data center this machine is on, and which switch this particular server is connected to. So we, to import nodes, we basically leveraged our internal OpsDB, imported this node, created this ironic node create node using the information from OpsDB. Once the ironic node is created, we also set the driver to initially to fake Pixy. We do not directly use the Pixy IPMI tool driver, so we switch to fake Pixy first. And then we also know the IP address, the existing IP address and the macros of the machine. So we can also create a Neutron port. So create a Neutron port as well. And then we do an operation called fake boot. As the name implies, it creates a NOVA instance and associates this instance with the ironic. So NOVA instance have a UID and to associate something with ironic, the instance UID will also show up in ironic. So this is similar to the new adoption state in ironic. So ironic supports this new thing called adopting. So you can put a node in from end roll to manageable and to adoption. And then without ever touching the node, you can turn the node into active state. So the similar thing happened here. This is because these machines are serving traffic. We do not want to reboot them. We do not want to like remage them because it will affect production. So once everything is successful, we clean up all the metadata about images. Once the instance become active, we switch to the pixie ipmi tool driver. So if a customer, let's say, belongs to yahoo mail and now we fake booted a bunch of yahoo mail nodes into ironic. So when he uses OpenStack, he will access the OpenStack APA through NOVA. So the customers do not directly have access to ironic APA since ironic is an admin only API. So when the customer does NOVA list with the mail tenant, they will see the new ironic nodes imported and running. They can now do rebuilds, delete, do whatever they want with the nodes. So this is how we imported existing nodes that are running in production to ironic. Let's talk about ironic service itself. We have ironic service, ironic API, running behind Apache servers. So the three APA servers initially we saw, all these three APA services run ironic APIs behind Apache. We also have two ironic conductors. We initially started with just two conductors in two separate machines. These machines have something around 24 gigs of RAM and 24 core CPUs. So what could possibly go wrong? That did not happen. Similar to something happened. When we imported around 10,000 nodes, so we were ramping up adding new nodes into ironic and booting them. The ironic boot started to fail. We also saw the ironic conductor had high CPU usage. So ironic conductor is a conductor process. We saw the top, it was always at like 90, 90 to 100% CPU. And our service engineers complained that it took so long for the ironic API to respond. Obviously, ironic conductor is busy. So any conductor operations like ironic node set power state, node set maintenance, all these things took a long time. So they were complaining. So first two address is the high CPU usage. We wanted to know why ironic conductor was basically taking high, consuming a lot of CPU. Enter the sync power state. We can also call the, call it a DOS, DOS API for ironic. If you want a DOS, ironic API, you can trigger sync power state. So what sync power state does is it basically, if you have pixie IPMA tool driver, it's going to use IPMA tool, get the status of the ironic node, the power status. So the ironic API also has a power status field. This field has the current power status that should be the node on. So a customer can set ironic node, set power state, power on. And the power status will be power on. So sync status runs, it's a periodic task. It runs once in a while and checks the current status of the node and syncs those. The sync is from DB to the actual node. So for example, if somebody powered off the machine, the machine should be powered on, it's he or she powered off the machine outside ironic. The sync power state will come back and power on the machine again. It's also a good, it's also a good periodic task to have because if your IPMI BMC fails, sync power state will eventually find that this particular machine is not accessible through IPMI. So we cannot eliminate sync power state because we will not know the status of our BMCs, how things run in data center. So sync power state, why it took a long time. So the default interval for sync power state is around some, initially we had around like an hour. So every hour this periodic task will run and sync the power state from DB to the node. So at 10,000 nodes in a cluster, so this is like initial time, you know, like almost one year ago, when we had 10,000 nodes in the cluster, the conductor was busy just doing IPMI tool commands. So it was just forking IPMI tool and checking the power statuses. So obviously we can reduce the periodic task interval. So we reduced it to somewhere around every 24 hours sync power state, run sync power state. We also noticed that in a data center, if your server runs for a long time, BMC could fail. The firmware in those BMCs leak memory. So whenever you say do IPMI tool command after 2,000 days of uptime and your BMC will say, okay, leak memory and I cannot respond to you. So it took longer than expected. So whenever we ran the IPMI tool command, instead of immediately sending the power status, the IPMI tool command and the BMC will return the status very late. So we cannot do anything about that other than resetting the BMC or working with the hardware vendors to fix it. And it's a hard to reproduce problem as well because this does not happen. The BMC failure does not happen often. It takes long time to reproduce these problems. And I do not really know why we pay $15,000 for a machine that fails like this. So second approach is to increase the number of conductors. I said we started with two conductors. And obviously that's so the conductors use a data structure, this is called hash ring. So the nodes are divided between, so let's say 10,000 nodes, maybe 5,000 is handled by one conductor and another 5,000 is handled by another conductor, which is obviously very low. So we wanted to increase the number of ironic conductors. So one solution is to run multiple conductors on the same host. This is tricky because how the ironic conductor is designed. Ironic conductor, whenever it comes up, it automatically fetches the host name of the machine and uses it as an identity. So we, as I said, we only had two servers and these two servers had 24 cores of CPU. So we have 48 cores of CPU and Python processors normally are bound to one CPU. We weren't using the CPU much. So we wanted to spawn multiple conductors on the same host. So we added a small patch. It's almost like an APO worker patch, which will spawn a number of conductors. So the problem with running multiple conductors on the same host is there could be some race conditions. There is another periodic task called sync local state in ironic conductor. This sync local state, what it does is, if one conductor dies, it checks whether it needs to take over the nodes that were managed by the conductor that died. And that is a good thing, right? So if one of the servers is down, we need to basically move other conductors to manage the service. But that could be some race conditions if you're running multiple conductors on the same host. So we haven't tested that. So we disabled sync local state. So we just make sure that ironic conductors are up and running all the time. We have better monitoring on ironic conductors. We are running right now around almost 40 conductors per host. So we have, like, 80 conductors in one cluster managing things. And obviously our SE team, site operations team were happy because they get results faster from the ironic API. Second, Neutron. To give you a, let me first give a background about our Neutron setup. So again, we had three API servers. All these three API servers were running Neutron API. We had 24 API and RPC workers. And we have four Neutron DACP agents. So these are the ones that actually serve the DACP. And all these Neutron subnet and network are managed by all agents. So which means any new network on the Neutron side, the notification goes to all four agents. So we have some kind of HA. So we replaced, so that by default Neutron comes with the driver for DNS mask. So we wrote our own driver for ISC, DACPD. So this is our Neutron setup. So to talk details about scaling up Neutron, I need to talk about sync local state and the issues with sync local state. Sorry, sync state. Whenever a Neutron agent, DACP agent restarts or when it dies, a new network could be added to the Neutron. So somebody could create a Neutron subnet create and your agent could be down. It will never receive that notification. So when it comes back up, it needs to know that this new network was added to Neutron and create and do specific action on the DACP side. For example, if a network, if a subnet is added and it comes back up, it needs to create that subnet entry in DACPD.conf. Your agent is basically talks to the DACPD to do things. So what sync state does is it gets all the network info whenever the Neutron DACP agent is restarted, gets all the network info, goes through every network, every subnet, every port and recreates all the things it needs to do. So it writes to the DACPD.conf, writes to the lease file. So this happens every time you restart Neutron DACP agent, which means every time you do a new deployment, you need to restart DACP agent and this needs to be done. So we had two drivers to interact with the DACPD itself. So Neutron DACP agent talks to DACPD and it talks to DACPD using a driver called OMShell driver and we also had a driver called PyPuraMapy driver. Let me talk about OMShell driver and let's see what it is. OMShell is a command, is a shell command. You can basically run that OMShell and you can connect to your DACP server which is running on the local host on port server 911. It also has a hash MAC based authentication key so that other users cannot log into a DACP server and do bad things. You can search by MAC address in your lease file, delete the MAC addresses, create new leases, all these things can be done through OMShell and OMAPY. So OMAPY driver, so OMShell interacts with the OMAPY drivers in the OMAPY API in the DACP side. So as you can see, this is basically spawning shell and basically process.popen and OS.communicate and write directly into that process. So when we had the OMShell driver, this is how our CPU utilization on Neutron DACP agent, this is during sync state. So whenever we restarted it, the Neutron DACP agent, this happened. So the green line here is the CPU system. So whenever the CPU was executing in the kernel space, the red line here which is at the very bottom is the CPU usage on the user space code when CPU was executing user space code on Neutron DACP agent. Since the OMShell is a process, we needed to fork a lot of these process and write to it. We had seven threads doing, you know, seven forks and syncing various things. As you can see, the CPU usage went up. The x-axis here is the duration and the y-axis is the load of the system. The load of the system went up especially and it was spending too much time on the kernel space. Also, you can see that the bottom line, it's 3,500. So basically, it ran for an hour. Actually, it ran for several hours. This was with 2,500 subnets and 45,000 nodes. So you have like 2,500 subnets to create and 45,000 ports to create. It was very slow. And all these hours, the usage was high. So if somebody booted a machine in this time, it would fail because it was busy doing this thing. It might not get the agent notification or it will take some time to process the notification. So this is bad. And then we discovered this new thing called pure Python implementation of OMAPI. It's a library. And this is how the CPU load and duration went down. So again, see the red, sorry, wrong button. So the red line on top is the, again, the user space, CPU usage. The green line is the kernel space. The x-axis is the duration of the sync state. And y-axis is the load. As you can see, this number is 600. So for 45,000 ports, it took almost 10 minutes instead of several hours. So the model of the story is never ever fork processes and write, you know, do things, to do things. Instead, if there is a pure Python implementation, use that one. So where do we go from here? So as I said, with OMAPI API, you can write things to the DSCP lease file. But whenever you want to add a new subnet, you need to write to the dgcpd.conf file. Whenever you modify the dgcpd.conf file, we need to restart the dgcp server. And this is not ideal. So we had a VIP. Actually, we have a VIP between dgcpd and rest of the infrastructure. So whenever the dgcpd restarted, VIP thought the dgcpd service was down. And it would not send packets to the dgcpd server for a minute. Now, if you look at pixie, the pixie actually tries to do multiple dgcpd requests. But if you don't send the packet, VIP doesn't send the packet to dgcpd, even though dgcpd resets are faster, it comes back in like two, three seconds. But VIP wouldn't send any packets back to dgcpd, because it thought dgcpd was down. We were actually not serving dgcp to the machines that were booting. That's one of the reasons for boot failures. Another good approach is to move to the KEA dgcpd server. So this is a new dgcp server that is written for high availability by ISC. So any changes to here, any changes to dgcpd.conf doesn't need any restarts. And it has a nice JSON API instead of some OM API, which you need to process some binaries. And fourth, let's talk about density test. So before we onboarded a lot of machines to Ironic, we did something called a scale test. And so basically what we did was we booted like 100 machines. We'd know a booted 100 machines and see how the system performs, whether it can do 100 parallel boot. It worked well. And then we also did a density test, which means we added a lot of machines to the OpenStack system. We imported around like 50,000 nodes into Ironic and see how the system performed. At 24,000 nodes, when we were importing 24,000 nodes, the APS server started swapping. So this is how things went down. So on the top is the memory usage, the bottom is the swap usage. So the top, the orange or saffron, is the used memory. The used memory went up over time. And also you can see the swap usage. The red is the swap usage. Green means no swap usage. The swap usage went up as well. All happened on week 13. 13 is a bad number. It's really true. So we had 24. This is our staging environment. We were doing the density test. We had 24 gigs of RAM. Obviously the best way to solve this is to increase the amount of RAM in that machine. And who's the biggest user of memory is Neutron. So Neutron was consuming around 1.4 gigs. And we had 24 APA workers, which means it was just eating a lot of RAM. So obviously the solution, so this was at 2400 subnets and 43,000 ports. And the easy fix to basically reduce the amount of APA workers and RPC workers. So we reduced it from 24 to 10. So instead of taking a lot of memory, it would take 10 gigs of memory. Another long-term fix we did was to basically, we still have investigated why it was taking so much memory usage. But we want to isolate Neutron to a separate server so that it does not adversely affect other APAs running on the APA servers. I think the Neutron is still, it's causing problems in upstream as well, especially with the gate. Their solution is to also reduce the amount of APA workers, which is not a scalable solution. So what we learned from running Ironic at scale. So do both a density test and scale test before onboarding new machines. So scale test is, by scale test, what I mean is we basically booted 100 machines and thought, oh, 100 machine booted up, came up, everything is fine. And we started onboarding machines and saw issues. So let's say you want to add 50,000 nodes into your Ironic cluster. Do a density test with 50,000 nodes and figure out what happens. This is like a one-on-one. And if you're using custom code or anything, avoid using, avoid spawning processes. So if you want to do a task, especially the periodic task, if you want to do a thing, never spawn processes. Use native Python libraries as much as possible. Also pay attention to periodic tasks. The default interval for sync power state was something around 60 seconds. It does not scale, it does not do well for a 10,000 node cluster or 50,000 node cluster. So pay attention to periodic tasks and all those configuration files. And be prepared to scale horizontally. Your database could swap. Your APA nodes could end up swapping because of high memory usage. So always have these additional machines in the subnet so that you can quickly scale up your infrastructure and the APA clusters. And pay attention to the number of APA conductors, APA workers, and RPC workers. And always don't forget to have one. Any questions? Yes. So we assumed there was memory leaks. So these machines, we looked at a lot of these machines and the uptime was several hours. The machines were never rebooted and we had to reboot them during the dirty call. When we tried to reboot them with IPMI, these machines would take several seconds to respond. So we assumed we are working with the vendor to figure out what the problem is. But we assumed it's a memory leak or the BMC just crashed. Investigate on it. Yeah. So the only thing we could do was do a MC reset code, basically do IPMI tool command MC reset code and pray to God that it works. And otherwise somebody has to basically go there and pull the plug and back it in. It's bad. We are also managing the same thing with basically we have this active process that scans the machines in data center. And before the customer even reboots the machine, we know that this machine has a problem and we raise tickets for the data center operation folks to go and fix it. So that is one way of coping with that problem. Thank you. Can you talk a little bit how you set up this and the problems that you found about scaling and reliability? For example, all Nova computes you have for this kind of infrastructure that you have? Yes. So the question is about Nova Compute and how it scales. Yes. Obviously, in the upstream code, you have the resource tracker, which basically runs and tracks all the resources. So we implemented something like claims API and entirely eliminated resource tracker. This was based on a older spec that was proposed to address that problem. The new way to fix Nova Compute is to basically run multiple Nova computes that is supported as well. So Nova Compute now supports that same hashing that irony conductor supported. That is one thing to look at. So we eliminated resource tracker. So the Nova Compute usage was very less. And we had two keepers behind Nova Compute. So whenever one dies, another comes up through the leader election. That is how we solved it. Okay. Thank you. Thank you. Hi. I think you are talking about the scale. When you have more machine learning there with the Neutron DHCP agents, have you ever encountered the problem that the DHCP, the address action has never been refreshed so that the VMU creator actually has a problem getting the IP address? Right. When you have more and more machine machines get created? Yes. So in ironic, we are not talking about virtual machines. First, we are talking about bare metal actual server hardware. First, yes. So using that OM shell driver, basically we are spawning multiple OM shell processes. And spawning, forking a new process takes time. If the system was busy doing a lot of that stuff, a boot comes in, a new port gets created. It could take some time for the port to be actually reflected back on the DHCP server. Right? Because it was doing other things like the sync state. On that case, we have seen the machine already got powered on, and it started doing pixie boot and never received DHCP. That is one. Another thing we have seen is through IP helpers. So sometimes the IP helpers were never on the switch. Second one. And third one is basically that HA. So you need to tell Neutron to send DHCP notifications to all the agents. There's a setting in Neutron.com to send how many DHCP agents you want to send the notification about ports and networks. That is another thing to look at as well. The root cause is basically OM shell and forking a lot of processes. And then we move to the native Python implementation. And things started to improve there. Are you running on V6 or V4 or mixed network for the Neutron site? And then also based on your experience, what do you think in terms of footprint that you can control with Ironic, how far does it scale on a robust single deployment cycle? Yes. So first question is whether we support V4 and V6. So we support V4 and V6, yes. Neutron ports can have V6 and V4 IP addresses, but we only support internally V4 DHCP. So there is no V6 DHCP yet. And then what is the second question again? So based on your experience, what's your projection in terms of the current rate that you can scale robustly before you have to scale out a complete style of horizontally? What's that footprint look like? Somewhere around 10,000 to 15,000 nodes. Even then we had problems. But to scale, we want to have somewhere around 70 to 100k nodes in one cluster. We are going there. We are getting there. And it's interesting. Yes. Thank you. Thank you. Any other questions? The next conference is ready. Thank you.