 I have one minute. Hey, good afternoon, everyone. How are you guys doing? Oh, come on. I need some energy. I had a jet lag. And if you guys don't give me some energy, I'm going to fall asleep in five minutes on this stage. How are you guys this morning? You've got to say amazing in a keynote from the NEC guys. Amazing, right? So my name is Simon Chung. I'm from Yahoo. And I'm here today to tell you my life story over the last two years running these crazy bare metal clusters that we have running up for our production system. So just in case you don't know who Yahoo is, I guess some parts of the world maybe not. So we'll focus on trying to make the world's daily habit inspiring and entertaining. We have over a billion users across desktop and mobile combined. And we've met our bed to move all our compute resources to OpenStack, whether it's VM or bare metal. Our goal is similar to Yahoo Japan as well. It's basically put an API in front of all our compute resources, and we're able to fully automate in a spin up and shutdown of resources quickly and easily in the automated fashion. So managing OpenStack bare metal hasn't been easy because we're running really, really old software. We're still running the grizzly version of OpenStack. And you can see we started VMs a long time way before bare metal. And you can see from this grove is the bare metal actually in May earlier this year has eclipsed the instances that we have in VM. So it's actually grown a lot faster, and that has led to some serious scalability issues that we haven't seen in our VM installation. Sorry for the graph with no scale. I can't give you the specific numbers, but I can say it's in the tens of thousands of VM, tens of thousands of bare metal. And just not surprising, we have so much bare metal we have 30 plus clusters in six regions globally to run all this. And we have actually, we're a traditionally bare metal shop to begin with, so we have hundreds and thousands of bare metal that we're trying to get into OpenStack. So for the time being, we will have way more bare metal, but we are having projects and efforts trying to move everything into VMs. Wherever it's possible, we're getting developers to migrate the application into VMs. So I'm not sure how many of you are here from Vancouver, if you heard about the talk from James Penic from six months ago. So here, just kind of follow up, I just want to give an update where we're at with that. So one of the things that we had a challenge with, because we're running grizzly, the imports are starting to take longer and longer. In fact, one note took about seven minutes to be imported and to be made available and that was going to take a long, long time. And I always joke, it's not in my lifetime when we get all Yahoo! So inventory into OpenStack. So I said, so we kind of took a pause on the migration for now, because the software is not scaling and we don't want to invest more in it and we want to move to ironic and given the timeline so close now, we're going to invest all our efforts into getting ironic going which is running, we're going to be running the Juno version and pretty soon sometime this quarter we're going to go to that. So after we move to ironic, we're going to resume again and start migrating all those hardware into ironic. So before I describe some of the challenges and learnings, I'm just going to go through our bare metal deployment architecture. So in the OpenStack control plane, we run OpenStack Grizzly, we have API nodes, so we have multiple API nodes running Nova APIs, Keystone, Nova Scheduler, and Nova Network. With the Nova Network and Nova Scheduler, we actually only run a single instance and we have scripts to work with ZooKeeper to check which one is the primary one and only start up one instances of that to process. So we also have MySQL database in two MySQL database running in dual master single writer mode and we run RabbitMQ in a cluster of two to three hosts. And we also run Nova Compute which we call the bare metal controller node in multiple hosts as well and we have put in a patch for HA, so initially it was one BMC will handle half of our nodes and another BMC will handle the other half and when one goes down, we basically lose management to that. So we put in a patch now, it actually automatically fails over and picks that up. So as part of our OpenStack system, we also have to deal with our internal systems like DNS which will update the names and host names and we also have internals configuration management database where we store all our hardware information and this works with an imaging, proprietary imaging system that basically looks for a flag in our configuration database and works on imaging the host. So the BMC computer also talks to the power management API which to tell, reboot the box and when this happens, the box reboots, it picks the boots to get the pixie information from the proprietary imaging service that we have and then images itself and then updates the imaging service that it's done and that in turn updates the BMC controller saying, hey, this node is booted up and then it goes active from there. So next I want to describe it about a bare metal lifecycle that we have at Yahoo. So our users start with, they order the hardware that they want through an ordering system and this system gets purchased, data center ops, rack and stack this host and add them to our internal CMGB system. From there we have inventory importer script that actually talks to the ordering system and also CMGB and then combines, collects all that information and it creates open stack bare metal inventory and then sets the quota for the users. So from there the users able to do another boot by interacting with open stack and it puts this host into an in use state and from there they're able to access the station to the host and use the host for whatever they want to use it with. And the next thing is, while it's in use state, the users have a couple options they could say, hey, I don't want this host anymore. They can over delete and that gets returned to the bare metal pool. Another thing that they have, we have implemented is break fix because hey, this hardware that I have has hard drive issues or memory issues. I really, I don't want it anymore and they do a noble break fix and that gets thrown into another queue that gets processed later. So I'll explain a bit more about that in later slides and then they can go off and then because they get the quota back they can over boot a new host from there. So after some time, the host may be really old, warranties expired and the data center, so we want to start collecting these old hosts and sell them off or do something of them, like no longer use because they take up too much power and data center or whatever. So we have this retirement process so we take it out of the stack and then another system will go through that retire the host. And then another way the hardware gets into the system is the horizontal migration that we talked about earlier that will basically take the inventory, existing inventory doesn't image the host or anything and just add them straight to the in new state and quota for the users. So next, so challenges that we have run into. So we didn't have quota for availability zone support. So I'm not sure if you guys understand what that means is the hardware that we have, when you noble boot usually just give it a flavor and because the way data center is in, mapped out, we have backplane security zones and when the IP is private or public and because we're still also running a nova network, Neutron, so all the host is installed with a fixed IP so it's kind of like just notes sitting in the data center and we have to have them mapped. So what happened is that we didn't have this support in Grizzly, so the users end up saying, hey, noble boot this flavor and they would say, oh cool, I've ordered the machine in private IP, in some security zone and they would get a host that's maybe a public IP instead of private and then they will get a different security zone so in core instead of like the DMZ, so it's totally not what they expected. Some of may not even notice that because it's just so wrong. And this led to quota discrepancies which means if a user, one user ordered 10 hosts that's in private IP, another user ordered 10 hosts in public IP and the person that's booting up say, hey, I don't really care, I'm booting, because it's private, I'm just booting everything up and they may get five hosts in private and five hosts in public. So, and they are happily running along using their hosts, no problem. The guy that ordered 10 public IPs because they are hosting some service that needs to talk to the outside world or say, hey, where's my host? I can only find five, the other five is gone. So, they start complaining and then we didn't realize this when we initially launched and then the quota discrepancy came along and yeah, it was a real headache. So, solution that implemented, we started using aggregates to define availability zones and also we applied quota per availability zone and custom scheduler to be able to schedule those hosts from the availability zone. And probably about three months of my life just coming through all the database, all the inventory-imported transactions tried to reconstruct this, who ordered what, where they booted up and in cases, talk to the customer, say, hey, can you please delete your host so I can give it back to someone else that ordered it? So, it was crazy. Many sleep this night as well. So, the quota per availability zone implementation is the aggregates we have defined, some of the name like pre-BP, like backplane one with single or multi-IP and a security zone. And when the user says Nova quota show, what they get is a flavor and the availability zone and then they have allocation, which is the quota that's given to them and the usage, how many of them they are booted up. So, and when they Nova boot a box, they would need to always give availability zone and that's a mandatory for them to, so they only get the host that they need. If they don't give this, they get no valid host. So, another issue that we ran into was resource tracker. The resource tracker, I think it's an automated background job that kind of pulls all the inventories and collect statistics and updates the status of the host. And as we added past a thousand hosts, this resource tracker was getting slower and slower to start. So, one day we realized, hey, we need to restart the BMC node or we're doing an upgrade. And the BMC node compute process will not start up for a whole hour because it's in the background waiting for this resource tracker to scan all the hosts and inventory again. So, maintenance become a nightmare. Basically, whenever you need to restart, it was just not, it would take forever. So, solution is at the time we just said, hey, let's just remove it because we don't need it. So, we implemented a more optimized DirectDB call with SQL optimization and now the compute node just starts up automatically immediately. So, Ironic also has this problem today. So, we have other blueprints that we're tracking trying to hopefully get similar features and get that improved. Next is Nova Quarter Show takes a long time. I'm not sure if you guys know, when your user runs Nova Quarter Show, it should usually return straight away, say, hey, this is the quarter you have available. For our users, for some of the larger cluster, it was approaching two minutes, they would type Nova Quarter Show, it would take about two minutes before it would return and say, hey, this is what you have. So, users were really, really annoyed with that. So, we, again, put in a custom Quarter Show API to be more efficient just using the DB to create these right things and we managed to reduce that from two minutes down to two and a half seconds. A lot of all this work is going into the Grizzly branch, so we're not upstreaming this right now because no one should be using Grizzly. I mean, we had to get a system up looking and to scale it and we're at a stage where we're just hacking bad stuff on all software just to get us going, but all these learnings we're trying to apply to ironic so that when we launch that, we won't run into the same issues. Another thing is, Nova Boot was taking a long time. So, someone would type in Nova Boot, it would take six to seven minutes before the prompt comes back saying the host is in a ghost scheduling state. So, you still don't know whether it's booting or not. So, and this led to a lot of people say, hey, I got three or four hosts that boot up, said just find, just wait a few minutes, couple of hours you'll be done. And then we have customers that say, hey, I want to boot up a 2000, 3000 notes or something in the next hour because we have, peak traffic coming and we're like, that's not gonna happen because that will take a long time and we will have to manually, whole team was over the whole weekend, we were booting up the host manually by passing open stack going back to our previous imaging system. So, that was painful. And for this problem, we haven't found a solution yet. We're scale testing ironic right now, hopefully we don't run into the same problem. But I think at this stage, we kind of, maybe just give up on this and move on, or maybe if someone has some ideas, we can think about it. I think part of this is also because of how we implemented the quota per AZ, there became too many combinations of flavors and every result that make this boot command take a long time. So, another problem we ran into a lot was backend dependencies, a system that we rely on like DNS system, CMDB systems. And this led to a lot of RPC timeouts in open stack. The DNS and CMDB was out of sync. Open stack, because the way the code, I don't know, somewhere we wrote it, that's basically when it fails, it just stops there and it doesn't clean up any of the dependency systems. So, that made some of the nodes that was broken basically get us in and no longer available to our users. So, solution, we increase the RPC timeout, but that is short term because we can't keep increasing because each time we increase it, if there's a real problem, the users don't get feedback saying, hey, this is something wrong and they will have to, right now we already have to wait about two hours before you know whether it's gonna be in error state or not. So, another thing is we've improved the code to ensure that if there's a failure, the DNS and CMDB changes are staying in sync so that they don't go out of sync. And audit all the DB and correct all the inconsistency so that the inventory will be available to the users. The other thing that we kind of was totally surprised to us was that the operations team was supposed to just manage the platform. And suddenly because we're the face of all of compute resources at Yahoo where issues that used to go to data center ops around the world with a much bigger team would now go all come to the first line or it's through OpenStack and then we have to deal with all the issues that we were not surprised at. And we have a lot of backlog because of that. So that was kind of learning that when we first launched it we didn't realize how much work there was for that. So we kind of started teaching side ops, giving the access to OpenStack, making sure they know how the basic operations so we can hand those tasks off to them, also train them to be more proficient with OpenStack. So they're not just another part, they have to be integral part of our team to run OpenStack. Okay, the other thing that we ran into was the importer was taking a long time. So same thing as migration that takes like seven minutes per host or something. Our inventory was each time we import hosts we would have to calculate the quotas and the nodes, existing nodes and then figure out the metrics to how much quota to get, whether we have enough quota to assign to a user. And it took almost an hour to import say just 10, 20 hosts. In fact, it wasn't really depending on the number of hosts it was more. Each time we need to do this calculation to import one host or 10 hosts it was taking hours. So again solution, we implemented an API directly to the DB that average the DB to do all the calculations and that improved the performance of from hours to literally minutes to import each bucket of hosts that we have from the hardware ordering system. Next thing is boost failures. Hardware failure is a fact of life. If you have one computer at home, you're lucky. It probably lasted a few years, it doesn't break. But when you run at hundreds of thousands of hosts, they say you have 1% failure rate, 1% of 100,000, that's a lot of machines, right? So the IPMI interface that we find always fails a lot and the host would get jammed so you lose control but you cannot remotely boot the host. So that translates to an error for the user so they cannot use their node and they just keep getting an error. Sometimes the host gets installed incorrectly and the pixie boot order is wrong so it just continue to boot to the hard drive and never gets re-imaged. And then so it opens up fails there again. And also hard drive fail. So imaging host also fails and then that causes issues. So our solution was to involve the data center ops more and try to use the same tool chain to make sure they can use OpenStack to boot those hosts, verify those hosts so that some of the basic issues that OpenStack can't talk to the host like the pixie boot order will not be there because they have direct access to the box, they can fix it there. For us, the OpenStack team, we have to work with them remotely for them to fix the host. So the other thing that we will do is collect and analyze all the failures. We'll try to determine the pattern, whether it's certain vendors or certain model type that we have issues with and try and figure out, hey, can we work with the vendors to fix some of these higher failure rates? So with all this failure, we kind of came up with this synthetic boot test by using Jenkins regularly to schedule nova boots on the host. We do a nova image list to make sure OpenStack, that cluster is still working. We nova boot at the host and then we SSH into the host and run up time just to make sure, hey, we can image the host and we can log in and use the host and then we nova delete to make sure we can delete it and return that to a pool. And that comes up as, it gives us a trend on the history of the host that we booted and how long it takes to boot. So Jenkins is great for collecting the logs so we can post and analyze it and all that. And so next is Reparo. So it's talked about break fix. So it's something that our team came up with is basically to fix hardware easy for users. It's something from Harry Potter and it's a flick of a wand and everything just fixed. So for the users, yeah, you don't have to deal with data sending ops, you just nova break fix and the host disappears, you get your quarterback and can boot the host. So it makes it easy for the users. So we introduced a new nova command and API. This is nova break fix, instance UID and a command on what the problem is. We build it on task flow for the automation and this is another open stack project. And it helps smoother experience for the users dealing with hardware, basically, they don't have to file tickets or anything, this is the nova break fix. It decouples the break fix process from the users so they can move on with their life and behind the scene, asynchronously, we fix the host and return it back to the pool. So it goes like this, the process, use the nova boots, they find an issue, they run the nova break fix and they get the quarterback and then they nova boot again and they move on. So with Ironic, the learnings and some of the future works I wanted to talk about. So we did some initial scale testing with Ironic. We did some benchmark using rally and we tested with 1,000 nodes at 200 concurrent boots. We were able to achieve 97% success rate, 2% no failure and 1% resource tracker issues, no failure as hardware issues. So there's nothing we could do that we have to fix that. The resource tracker issue, that's some scalability issues in Ironic that we had to fix. So to get us to that point from the vanilla Ironic installation, we had to tune some things. So in Ironic, we are using Neutron. So the Neutron server worker, we changed that from zero to 24. The Ironic conductor worker went from 500 to 500 and the Ironic and Neutron, the SKL min pool size and max pool size was updated for 10 and 500. Okay, and the other thing that we ran into was we updated our keystone to authenticate with our own internal authentication system and somehow the service account was also using that. So every time communication between, for example, Nova compute to Ironic, Ironic to Neutron, it would have to reauthenticate, get a new token and that was hammering our internal system and things were just slowing down. So we're getting more times out. So in the short term, we've kind of just disabled that and just put the authentication straight to the database. That way we can basically bypass that authentication and everything goes through the database. We probably want to work on something that makes the communication cache and reuse the token so that we don't hammer the systems. So some of the future work that we're looking into is with the Ironic API today, it's a single worker and if you need to scale, you need more of those. So we're looking at putting it behind Apache, running in mod-wiskey mode so that we can have more Ironic APIs. We need to have the ability to run periodic tasks in parallel today, things are sequential. So there are things that run, for example, like power management status polls and updates and that's a blocking call. So everything will get blocked behind that and we found that we will get RPC timeout because as you boot more hosts, if you have a host that's just taking a long time, it doesn't timeout on the power call, everything gets blocked and then everything slows down and everything times out and things stop looking really bad. And also, our goal is to scale, to be able to boot 1,000 hosts at a time concurrently and today we're at 200 and we're trying to get to 1,000 or more. Talked about the improved resource tracker. We need to make that better so that we, today in Ironic, it's also what we're seeing is that it would take about five to seven minutes for the resource tracker to start and get working. So before Novaboot, so if we restart the process, Novaboot would be available after five to seven minutes and that's for 1,000 hosts. So obviously if we add more, that was gonna take more time. So some of the blueprints of interest, Ironic multiple compute host support. So hoping that will solve some of the issues that we have with our compute node scalability. We want to switch the periodic tasks to the futures library. I think it's been approved as well. So that will help us run some of the blocking calls parallel so that it doesn't, independent boots won't be tied up behind a single scheduler. Then another thing that we're looking for is no IPA to conduct the communication. So today, the node needs to talk to Ironic and what happens is that that implies that you have network access to your control plane. So if any host gets compromised, they can exploit the control plane, possibly take over and then re-image your entire infrastructure. So something that we definitely don't want happen. And I think its interest is manual cleaning. And manual cleaning is, Ironic has this automatic cleaning thing after each noble boot or delete it will clean the host, make sure it's back into good state before the noble boot for the next user. There are things that we want to do like update firmware, maybe reset the rate drive and maybe do something, burn in test for node that will take much longer and we don't want that to be part of the automatic cleaning. So we need a feature to say, hey, this set of hosts are new or the time's up, we need to do some manual cleaning on them. So just a plug for my colleague, he's talk on Thursday. So if you want to know how we use Neutron for Ironic, this is his talk, it's James sitting right here. And let me tell you a funny story about this. This thank you slide. So I wrote a slide with all the people that are option contributors and all the people that helped me keep me all the sleepless nights or mostly help me out in sleepless nights so that I don't have them. I was going to publicly thank them, but in the end I had to delete it because after speaking to manager, it became a hit list for you guys to hire them. So we said, no, we can't share that. But I want to thank from bottom my heart, without our team back at work, our life will not be so easy, or my life will not be so easy as operator. And hey, if you're OpenStack Ninja and want to help us solve our problem, I would like to forward your details to our hiring manager as well. And of course, last but not least, without all you guys, the OpenStack contributor community, OpenStack will not be where it is today and we won't be able to run our infrastructure with OpenStack. So thank you for listening to me, Rambo, for the last 30 minutes. Last thing is, there's an app. We have to, I was asked to remind you guys to use the app to provide feedback. It will help me and I think help the OpenStack Foundation to, you know, I don't know what they do with it. Maybe choose speakers or please be nice. I don't want to get banned from future summits and I really want to attend the next one, so. So that's it, questions and answers. If you can use those mic at the front for questions, they'll be great. We have about 10 minutes for questions. I think the spec renamed from zapping to manual cleaning or something like that, yeah. No question, was I, did you guys didn't understand what I talked about? I was really boring. Oh, I covered it really well. Thank you.