 Hello, everyone. Welcome to the talk about duct tape, bubblegum, and bailing wire, and the 12 steps in operating open stack. Before we get started, I just wanted to familiarize everyone with the analogy about what a duct tape, bubblegum, and bailing wire fix is. So as you can see on this opening slide, there's a bit of duct tape applied to an airplane wing. It's not something that you'd really like to see if you were getting on a plane, getting ready to fly across the Pacific Ocean. But evidently, it's an important fix, and it's an important fix that needed to be made, and it works, and it gets something done. It's a fix that probably the guy's not too proud of, but it's something that's needed to function. And so as we get into our talk today, we're going to kind of set some ground rules. The first part's going to be we're going to talk about these 12 steps that we kind of adopt, and then we're going to talk about some of the different things, different fixes that we use to keep open stack up and running. And then the second half, we're going to have kind of open the floor. We're going to look for audience participation. I'm sure as we get through this talk, there's a lot of things that are going to kind of come to mind. And so I encourage you, if something strikes a nerve, to come on up at the end of this and share your strange and interesting fixes. So with all that being said, we're going to first go for the 12 steps, the kind of the ground rules. You'll notice we've got just our first names up here. I'm Eric. This is Matt, Mike and Chris. We don't have any company affiliation right now. We're trying to create a safe place where we can all admit to these things that we do. We don't want to have shame for any of these things that we do. So with that, the first step of our 12 step program is we admitted we are powerless over upstream that our lives have become unmanageable. If you're an operator, this is very familiar. This probably looks like a few of our desks, I think maybe. The next step, and this is kind of why we're here, we've come to this belief that an ATC greater than ourselves could restore us to sanity. So this kind of speaks to us getting involved with this community, trying to share feedback with the development teams, with the cores, with the ATCs, the PTLs, and try to kind of shape the future of OpenStack and kind of fix some of these problems that we run into every day. The next step here is we've made a decision to turn our will and our lives over to the care of community involvement. So that's the third step in our process. You can see that that looks like one of us at our desk after going through an incident. The next step, this is especially important for operators, is to make a searching and fearless moral inventory of all of our hardware. So you can see if you guys work with data center operation kind of guys, this is where we like to hang out and make sure everything is where it's supposed to be. The next step in our 12-step program, we admit to upstream ourselves and our customers the exact nature of our wrongs. Sometimes this is pretty obvious what our wrongs are when things don't go well, but it's important for us to be honest that things don't go correctly. The sixth step in our 12-step program is we're entirely ready to have upstream remove all these defects of character. Once again, this is part of why we're here this week to get involved, to try to help shape the future of what OpenStack is going to be. The seventh step is we humbly, humbly ask upstream to remove our shortcomings. The eighth item is we make a list of all of our customers we have harmed and be willing to make amends to them all. Sometimes that list can get kind of big, sometimes when things don't go well. That's a phone book, yeah. If Neutron doesn't, if Rabbit goes poorly, that's the list of customers right there. We make direct amends to our customers wherever possible, except when to do so would harm them or injure others, or their VMs, or their tenant networking, or their storage, or any other thing that can go wrong and usually does. The tenth step is we continue to take inventory where we were wrong and promptly admit it, try to fix the things that we know we're broken with. The eleventh step, sought through code review and bribery, which probably happened this week, probably parties or whatever, to improve our conscious contact with upstream as we understood it, preying only for the integration test to pass, and for the power to plus two and merge things. Yeah, we'd like to not have to hit recheck on changes. And the twelfth and final step is having had a spiritual awakening of the results of these steps, we try to carry this message to other operators and practice these principles in all of our affairs. So those are the twelve steps of being an open stack operator that we kind of start with to lay the groundwork and set the environment for how we're going to kind of hold our meeting here. And so with that, I'm going to pass it off to Matt to get into the meat of things. Thanks, Eric. So now that Eric's defined the twelve steps, it's time for us to come together to make amends, to admit our shortcomings. It's time to talk through the duct tape we use to tie things together, the bubble gum we patch holes with, and the bailing wire we use to mind open stack to our infrastructure. I'm going to start going through a few of these now. The first thing we all need to admit is that we all have shameful crown jobs that we run, keep open stack running. Glance, cash cleaner and pruner, our hard drives are too small, and Glance kept filling them up. So we run Glance, cash cleaner and pruner every minute on the minute to ensure that Glance can't ever fill the hard drive. The second one is a neutron health check and quotations. Basically, to prevent L3 agent problems every so often just restart everything. Keeps neutron happier. Horizon session cleanup. For a while, any time the load balancer checked Horizon, which is a lot, and we have a lot of load balancers, it started putting stuff in the database and eventually the database fills up. So instead of fixing a problem, we just have a crown job to drop everything out of the database for Horizon sessions. Finally, we have an open source time series database that's not super stable. So we have a 20 minute staggered crown job to restart it on every node in the cluster one by one. So the cluster stays up and running. And hopefully while we're going through this, you guys can think of some of these things that you can come share with us at the end. Okay, just restart it. We all admit this is a solution to problems, especially if Google doesn't give us a better answer. So, you know, does glance seem slow today? We probably just ought to restart it. Hey, 20 of our Nova compute checks just blew up a nice thing. They turned red. We should probably just restart it. I think rabbit just died. We should probably restart everything. And we have tools to do this. And I know other people do too. And this is not ice house stuff. This is stuff we do in kilo. Everything listed are things we still do in kilo. Before we knew better, we've all learned a lot over the past few years in open stack. Originally, if we had a problem with the floating IP, we'd restart OBS. Turns out this isn't always the best solution. Especially when you have a lot of routers on a box takes forever to get the flows back. And you have everyone unhappy instead of one person unhappy. In our early days, if rabbits seemed weird, we would just bounce that. And especially before heartbeat support, nothing really like that. And that kind of goes back to the earlier slide. This last one's one of my favorites. We ordered a bunch of hardware showed up got racked in the data center. And then we found out it couldn't boot the OS we wanted. Okay, that's it for my confessions. I'll hand off to Mike. Hi, I'm Mike. I'm a open stack operator. So let's talk about tooling a little bit. If you've ever done any work on your house or maybe on your car, something like that, you know, there's nothing worse than trying to do a job when you don't have the right tools. Okay, these are some good examples. This guy's my favorite. He's really taken it to the next to the next level. Thank you Google image search. But seriously, here's some things that we kind of need to do that we don't have the right tools for replaying events. So we know that events that Nova sends out or maybe neutron sometimes have a tendency to get lost. The agents may not be up at that time or something something times out, really need a way to be able to replay some of these things and kind of get things back to a better state. And there's really nothing out there that's that lets us do that in a in a sane way today. Clean up jobs anybody have scripts to do cleanup? I mean, I know everybody does. And you know, or from or from VMs or from QMU processes, somebody deletes their project out of Keystone, we all know that there's zero automatic cleanup of everything that's owned by that project. And you've got to go back and fix all that stuff. Security groups and neutron is another good example. I think we're all pretty intimately familiar with all the quota problems and Nova and how, you know, that almost always does not represent reality. And there are things that we have to do to go reset those and make sure that people can actually deploy VMs when they should be able to even though Nova thinks they're out of quota. Another a little bit more of an edge case is just kind of this idea of being able to orchestrate things a little better. So one example is having to evacuate a compute node, you know, maybe it's got 2030 even more VMs running on it. And you want to live migrate those all off. So you can do some kind of maintenance. Well, probably don't really want to kick off 40 live migrates at once on a compute. I know that's a real example for some folks here. Really would like to be able to serialize that stuff kind of do it one at a time and in a little more better, better controlled fashion. And again, there's just, you know, there's just really nothing out there that allows us to do those kind of things today. So what do we do? Who has a directory on their production servers? It looks like this with a bunch of random scripts. I mean, everybody should have their hands up. This is everybody right? Right. Yeah, this. Yeah. I mean, literally, this is the roots home directory and one of our production servers. So you can see, I mean, these are things that needed to get done. We just got to get this thing out the door so we can move on with our life. Or even better, I pulled a couple things from our bash history. I like to call these the Rube Goldberg bash one liners. Yeah. You're not alone, right? So we kind of all know how this goes, right? And, you know, we always have this good intention of like, well, when I have some time later this week or whatever, I'll clean these up a little bit and I'll upstream them so that other people can use them. And we actually do have an OpenStack operators project now that's supposed to be for this kind of stuff so that we can stop reinventing the wheel and everybody doing the same things over and over again. So I'd encourage you to take a look at that. The short name is just OSOps. But the reality is you never get there, right? You never have that time because there's always something on fire. There's always something to be fixed. And I mean, that's just the reality that we kind of have to admit to ourselves. So a couple of the kind of fun stories we'll call out here is just kind of unexpected things that happen when you make a change that you don't. You're not really planning on it, okay? This guy probably could have foreseen that coming. Should we just watch it about five times? And then I'll keep going. No, so one thing we did is we turned on config drive, force config drive and NOVA to make sure that all VMs get a config drive. We used that to apply the networking config to all our VMs. So we wanted to just make sure that was going to be there globally. So, you know, flip that switch on should be great, whatever. Well, now it turns out all the VMs that existed before that point that didn't have a config drive. Now, NOVA won't boot them because it can't find the config drive device file on the hypervisor. So that was a fun one. We had to go back and backfill with just blank config drives for everything out there so that we could actually boot all the VMs. And again, one of those things that you just don't, you know, you make some assumptions and you just don't expect to happen and then you've got to deal with it. Kind of a similar thing with serial consoles in NOVA as well. We turned that on so we could enable some out-of-band type of stuff into the VMs. But ran into this strange race condition where if the console device had been cleaned up or deleted ahead of time, then NOVA wouldn't be able to delete the VM because it would try to delete that device itself. It couldn't find it and it would just fail the whole operation. So, you know, now we've got a pile of stuff that's accumulated cruft that we've got to go back to our cleanup scripts and go and reap all that stuff. So, you know, just all these crazy things that happen that you don't necessarily think will occur. So, that's my spiel. I'll hand it over to Chris. Hi, I'm Chris. I've been open. Thank you. I've been operating OpenStack for about three years now. So, sometimes the problems that we run into is the case that we're doing something that isn't actually supported by the project. So, for us, we do Layer 3 Networks. We run our network as a folded-closs design. Basically, the upshot is we have Layer 2 that only exists at a top of Rack and we have multiple top of Racks and then we have that collection of top of Racks, those Layer 2s. We call that production or we call it dev or we call it test. And in our particular network environment, we don't do Layer 2 everywhere. Layer 2 is constrained to a particular spot. So, in certain cases, it's a case of a missing primitive in the project to actually describe what we do. So, again, one of the cases is we need to have, for us and quite a few other large operators, we need to have a constrained segmented network, which means that this Layer 2 domain is only actually available to the set of hosts because, oh, thank you. So, for some of us, like I said, any L2 anywhere is nothing. So, this is actually kind of like an anti-neutron policy. But the good news is we're working on the large deployment team, I was working with some neutron cores to actually get a model that supports this type of thing put into neutron. The second thing is once you have this network, you can actually have subnets that you can route anywhere within that network. So, right now, any time you define a subnet, you have to define a Layer 2 network. That model doesn't quite map to this type of thing because those are just routes, like you just route a subnet. There is no broadcast domain, there is no Layer 2 thing around that. So, it also doesn't support Layer 3 routed networks. So, there's also another spec that's working on and hopefully we're trying to get something working between some of our larger deployments to actually get that working. But, neutron itself, right now, can't support the type of network that we have. So, we have a bunch of hacks put in on neutron and on novicide to abstract away the network information. So, our end-users basically aren't aware of kind of what's going on underneath, but it creates a fair amount of pain for some of us who have to do this. And sometimes, it's really not open-stacks fault, it's set for that terrible Layer message. So, who here has seen this error? All right, so for the others who haven't, maybe a little bit of learning, we did a kilo upgrade and about two days later we started seeing this error across all our compute nodes randomly in an environment. It was the second environment that we had upgraded and basically it took us two days of troubleshooting, adding code, debugging to figure out what the heck it means. But, the upshot is, if you're running Nova Conductor, all the compute agents talk to Nova Conductor for DB access and that provides the model server. So, basically it's saying I tried to coct conductor something happened and I wasn't able to complete my DB query. The issue ended up being that between Nova Conductor and the database server, we had 12 sets of links, one link was having 20% errors. So, if you average that across the entire traffic that was traversing that set of links, it's under 1% of traffic was affected, but it caused impact across our entire environment. And what's even worse, the Nova Conductor was logging exactly zero errors about having bad access to the database. So, it was successful in recovering, but when the model server went away, error actually happens. What really means is that a query took nine seconds to complete from Nova Conductor. So, more terrible error messages. So, hands who have seen no vela host found. If you're not raising your hand, you're lying to yourself and you should see a step one. There was an entire session dedicated to troubleshooting. No valid host found. I think that actually was today. Basically, it happens for a variety of reasons. You can have capacity issues. You can have configuration issues. You can have transient issues inside your cloud. You can have rabbit issues, filter issues. That's actually what the there are not enough hosts available. If you screw up your filters, you don't return any hosts, then you get that error, which is a little more descriptive, but not actually telling you what the problem is. But back to the no valid host found. This is the error that we show our end user. So, if you're a public cloud, maybe you don't want to have them seeing all the kind of the internals. But if you're an enterprise, you want your users to see it so they can help diagnose kind of what's going on. Because sometimes they can put, maybe they put the wrong tag on a glance image and it no longer matches a host or they mistyped it. So, it's a windows image but they misspelled windows like they switched to letters and now it no longer goes to the correct host and it doesn't boot right or something like that. We need to be able to have the option to let some of our end users see some of the error messages. Because right now no valid host found for all of us ends up being a call or a ticket saying your cloud is broken. So, with that, that's our that's our basic intro. I hope that sparked some some ideas and everyone there. So, with that we wanted to kind of open it up and try to gather more feedback, more horror stories, more crazy fixes, more anything that you can think of. So, with that we'd like to kind of open it up. We'd also like to thank everybody for being here and the participation that all the operators have with the upstream teams, the acceptance that the upstream teams have for including us and all of that kind of good jazz. So, with that I'll leave that 12 steps up here in case anybody needs to refer to anything for the rules. But with that I'd open it up to the to the floor. Yeah, and if you want to step up to grab a mic that'd be great but if you want to just stand up and can speak loud that's great too. Yeah, so we have we have situations like that too where we'll carry a local patch that upstream won't take or it might just maybe upstream will take it but it'll just take a long time to go through a review and all the tests and everything else in the feedback process. So, we have something like that we will we will mirror the get repos that are upstream and we've got a tool called get upstream which is also by OpenStack Infra that'll help us kind of carry local patches and check whenever it gets merged upstream then we'll kind of close that patch down and we won't carry it anymore since it's been merged upstream. So, we use a tool like that. We maintain our own get repo and basically we try to follow stable branches so we'll rebase our patches on top of that and build packages directly from get repos. Yeah, so that was one of the things we're working on is trying to get all these local patches that some of us are usually implementing for business logic or to fix something. We had a pretty big list from about 12 or 15 companies and came up with a list of things that we found a bunch of people were actually doing the same thing and trying to assign people to work on getting those actually upstream so pretty much everybody can take advantage of it and working with the Ptls and some cores in those projects to actually try to get get some of this fixed for not only just us so we quit carrying patches but so other people when I tell them like oh yeah we do this one thing and they go like oh I want to do that where's your code and you go okay I'll get something out there somewhere so you guys can have it. Yes, so some of us will do that where we'll have we'll build our own packages and as part of building that package we might inject a patch in there as well and carry it that way. I think there's probably a couple of different ways and some of it kind of depends upon the situation that you're in I think. We do it both ways we carry a git we put our stuff in a git repo or stuff that we take from other members of the community people have been willing to share patches with us we'll carry those as patches on top of a package that get applied by the tool that we use we use Anvil to build our pan packages. I know one of our like informal rules at least for our group is we try to make sure that if we're going to carry this patch we should at least put it up for review that doesn't mean that it's going to get accepted or merged or anything like that but at least that says we've we've kind of taken this step to reach out to the community to say like we think this is needed and at least kind of put it out there hopefully it gets merged and we don't have to carry it anymore and it just becomes a thing but sometimes it takes a while right yep yep hi my name is Kevin hi Kevin I've been operating OpenStack for gosh like five years now I think like pretty much since the beginning the first one I deployed was Bear from Trunk so that was a good time and a lot of these problems have been fixed so there is hope the 12 steps work but to uh I have a bunch of stories so I'll start with one which was almost a disaster but didn't end up being so we had object storage we're using object storage to host images and we had run out of room in our in our object storage cluster and we'd run out of budget to buy more um and at the time I worked for a company which happened to also have an internet provider business that had set-top boxes in everybody's houses and so they decided that a good idea might be to leverage the extra 80 gig hard drive in all of these set-top boxes and create a giant nationwide cluster Kevin congratulations you win well fortunately we shut that down so that didn't that didn't actually happen but it wasn't because they didn't think it would be a good idea technologically they were just afraid of a lawsuit I think so that's that's a pretty good one I can go on but I'll let somebody else have a chance if anyone else wants to talk the bar is set pretty high right now so back to him I came in a little late I'm not sure how many sees the RPC timeout errors okay so I've I can talk of two or three different RPC timeout errors the the first time it happened we were all figuring out why it was happening resulted in a rabbit mq clustering broken and if you're seeing rabbit RPC timeout errors rather than looking at anywhere as the first place I was just looking at is to see if your rabbit cluster is working as expected this is because you might be sending message to one of the primary servers and the listener might be connected to the other one and if they don't talk to each other this guy would wait for like what three three hundred seconds I believe which is the default and it would timeout the other one was grisly this is this is a little older but what used to our devs swore that every time if there is a rabbit restart or if there is a connection failure but apparently no one compute doesn't do that so it does recently but like yeah this is I'm talking about grisly so what would happen or what happened is our playbook said that if you see any clustering issues just go restart rabbit and it would do two things it would first wipe off the queue our script was written to wipe off the queue this means that every transient message would be lost and the other thing is subsequent boots whatever was broken would start working but the other machines which are working would break this is because rabbit mean the no compute wouldn't reconnect back to the rabbit for sure if you restart rabbit usually more than two things happen and none of them are good what are those two things well you'll usually have to restart everything okay right okay yep and and transient data loss if that's what you meant okay and this this is the most recent one again we saw connection time out but rabbit everything was was working fine but this resulted this was prime was a problem in our custom code which we had overridden the NOVA network to make our custom database calls and somebody had written a code to capture socket errors so basically we were leaking file descriptors and we would capture that if you do an S-Tracy you would see that no socket available or unable to open up a new connection but you are capturing that and throwing a nice error back seeing connection time though here thank you but but not not the ADDB my name is Wei I'm working for PayPal so we have well operate the the open stack cloud for I think for years so I just add something for that rabbit MQ issues so we have firewalls between the controller nodes and hypervisors that makes makes the rabbit connection even worse so sometimes if you have an idle connection next if you don't have a message in the setup for a very long time so the the firewall drops your connection but the but either side doesn't know that so there's still things connection is is active in that case even you reboot rabbit MQ it doesn't help because the NOVA compute thinks the connection is still good so it's as if you have to reboot both so the so currently we after we upgrade to Kino so we we enable the heartbeat so far it seems works yep for sure the the rabbit heartbeat stuff that was merged recently definitely has helped out a lot of issues for sure hi my name is Steve hi Steve thank you I've been running OpenStack since 20 for 2013 excuse me it's been a while and I want to tell you about one horror story that happened to me one of our use cases was to run Cassandra on Seth on OpenStack and we deployed Cassandra in a test configuration small drives and it worked fine and then the Cassandra team took over and they started ramping up the IO and all the instances died and fell over and sank into the swan ramp and we figured out finally after many weeks that we could get the same behavior by creating very large drives and running F and running make a fest on them repeatedly and sometimes they would die going on trying to figure this out trying to figure this out and finally Red Hat comes back and says how many file descriptors are you setting for your limit because of course Seth opens a billion TCP sessions which all require one file descriptor and so when we did a look at the number of threads we were running about 4,000 threads per KVM process which is a Seth library thing and each thread had an open TCP session too some OSD because we had a couple of thousand of those and so that's what happened and so you know setting limits fixed all the problems and it works great but man couldn't figure that out for the longest time so make sure to check your limits yes if you run into problems yeah yeah know your limits that's right know your limits which is bad my name is Igor hi Igor and I've been running Copenstech Cloud since 2011 at eBay at PayPal at Symantec now so I have quite a few horror stories but I'm not going to try to share all of them today so talking about limits now this is a recent one for us anyone here run out of loop devices on the hypervisors loop devices the default is eight and if NOVA for whatever reason failed to release loop device and that happened to us already a few times then any new provisioning on that node fails and because that node at that point is has the most resources available it always goes to the same damn node that's intelligence yeah so we started increasing that kind of on all of our hypervisor from default eight to say 64 just to make it a little bit safer now back to connection issues glance I like that one actually so that one is interesting we have database and database cluster fronted by load balancer load balancer as usual have long-lived connections and then it drops the connection to the actual database glance doesn't know about that now it's waiting for say what about I think it's two minutes or maybe I'm wrong maybe it's 10 minutes for that connection to for that operation to fail before it actually can recover and start a new connection kind of not a problem interesting one so one more on the connections no that's not the connection keystone that one actually my favorite because I spent on it personally a few few nights so user complaints about latency issue with keystone so we go we check everything seems to be normal keystone runs perfectly fine everyone else is have no problems but this specific user have spikes anywhere between like two to say 15 seconds and then we try to figure out what happens to that specific user well as we find out the lock memcached lock before it goes to memcached it tries to lock and it has this interesting calgary for the back off it's random 0 to 1 second and 15 retries so it locks itself and then you get interestingly and you were up to 15 seconds delay when you try to to get token in parallel from multiple threads enough I think so tacking on to that glance one we've noticed a couple times when we have issues between where glance register is running and the database in which glance registry talks to you that sometimes it'll just after that connection goes away and comes back it will just throw 500 errors until you restart glance that's the only fix right anybody else have I'll just the what so I don't know anybody has ever used the neutron l3ha tool quote unquote that was in things stack virtual yeah so that was actually a hack and I'm just going to leave it at that but that was I wrote that I'm sorry that wasn't the one I was going to talk about though so we had this is actually an OBS problem we had a tenant go live and they effectively wanted to create a CDN in open stack and so they had but what they did was again this is this is a different company that had they needed unmetered IP addresses so effectively if you downloaded something from your cell phone it didn't count against your your data plans quota and it took a year to get an IP address unmetered and so in their infinite wisdom they got three IP addresses unmetered for a for a program where they wanted to send out something like 15 million firmware updates to android phones and so and so they bring up three floating IPs and create three HA proxy instances and proceed to 0.15 million I cell phones at these three addresses and now this was OBS one like 10 I think it was and so it had actually a hard coded flow limit of like 2000 and so now as you probably know l l 3 you know every individual port and and IP unique one creates a new flow so you know the first wave of people hit it oh my gosh the l l 3 agent just falls over we're like holy shit how are we going to like what are we going to do with this thing so so we so we what we end up having to do it's like thrashing the CPU is pegged and also it wasn't multi-threaded in its flow management either at that point OBS wasn't so it's like CPU is pegged at a hundred on one core and so we what we ended up having to do to make the long story short was we ended up having to rip out the l 3 agent software entirely we threw hardware ASAs in front of it set up those same IP addresses one to one that and then had to like bridge that in so we had to create recreate their HA proxies instances with a with a second interface so you had basically the direct coming in but it was coming in the IP was hosted on an ASA so that was a lot of fun and at least four weekends of my wife being very unhappy with me so all right that's 40 minutes thank you everybody