 All right Hi guys, welcome to tails from the trenches. Oh, there's water. Thank you Welcome to tails from the trenches the good bad. I'll give over stack operations the good news is we are the only thing standing between you and the end of the summit So Unfortunately, there's a one of our one of my colleagues or one of my friends was not able to make it John Dewey He was supposed to give this talk So one of my current colleagues Jesse Keating is gonna help me give this talk today I am not the John Dewey you were looking for Oh, that's me. So I'm Jesse Keating. I have been for the past couple years the deployer of the public cloud We like to call our stuff the Kraken we release it frequently before that I was a release engineer for Fedora for about seven years Been a contributor to Ansible the orchestration engine quite a bit and to many other open source things Besides that And my name is Craig Tracy as this slide points out you may remember me from the failed keynote at the Portland summit But I've been involved in open stack now for probably a little over three years I've been a contributor to open stack contributed to a whole bunch of ecosystem type projects around open stack as well as other public public public cloud providers So basically day-to-day soup-to-nuts open stack Okay, so One thing when that we discovered when we're Coming up with the content for this talk is that each of us had very different perspectives. So right now I'm in the company called blue box the Business that blue box is in is selling private cloud as a service Which means we stand up very small or small to mid-size open stack clouds for customers Completely dedicated to them whereas Jesse's coming from a rack space a public cloud Which is an entirely entirely different thing, right? So I'm also a blue box employee, but a few weeks ago. I was a rack space employee in a rack space We dealt with a small number of very very large clouds. In fact, they were all pretty much the same cloud They all had the same off-end point You could move stuff or you could launch stuff and one or the other the other But essentially they were there were six mostly independent very large public clouds So just this is my own curiosity and we were talking about this just before the talk I'm curious to see how many people are our operators of a public cloud or open-sack cloud today Okay, and like let's say on is that multiple clouds if you're managing multiple clouds How about greater More than one Do you have yeah, how many things do you independently deploy to that you have might have different versions of open-stack running on Okay, and how about you like maybe more than 10? Okay, if you alright that also adds I think a bit to the perspective of what people are doing So the first thing we want to talk about is installation and this is really tales from the trenches but as an operator one of the things that you're going to have to deal with is installation and There's a famous quote. There's this famous quote. Does anybody know who said this? It was me It is not a solve problem. I maybe about a year ago. I thought that this was a solve problem and I One of the projects that I contributed to quite a bit was the open stack I'm sorry the Stackforge chef cookbooks I was a core contributor for quite a time quite a while and saw a ton of churn in in those cookbooks that that churn continues today There are tools out there that you can use you can use chef puppet salt Ansible and our tool is ansible base. It's called Ursula. There'll be a link for it at the end But one thing that made me really realize this is when I looked at the source base that we have on Ursula today It's 15,000 lines of code There are over 1500 commits that happen all the time and last week I submitted a pull request that was 1500 lines itself So is this is definitely not a solve problem yet? When I when I joined rack space, I thought it had been a pretty much solve problem, too They they were fairly well established I was coming in to help them out with a lot of the automation a lot of the the more interesting things that I thought of at the time But in reality as we got better and better at doing our rollouts and doing our installs We realized that a lot of the decisions we made and how we did our installs We're having profound impact and what we were able to do with our cloud later on and so before I left we were in the investigation process of Rediscovering and revamping how it is that we install and manage our open stack So installation by far not a solve problem. Yeah, and No, I think it's installation is a very Personal thing or personal to your business and and how you deliver it So these projects like the the open-source open-source chef cookbooks or even our code is very opinionated We welcome contributors, but and we'd be happy to make it as Unopening it as possible, but it still ends up being something like that. There's also a a relaunched effort at beefing up the Open stack playbooks for Ansible for creating an open-sac Environment it's going to happen on stack forward It's going to be a reference model for how to stand up and we're going to attempt to make it as Unopinionated or at least is to start with as unopinionated but with the hooks in there to make it as opinionated as you want to So as we as Jesse and I talked more about our different experiences, you know Again me me delivering lots of small clouds, whereas he's delivering one large cloud One thing that was that became clear to us is that despite our differences in the way we install despite our differences the way We operate it Especially around installation. There is some kind of convergence in the community It seems that people are starting to deliver things in in a in with artifacts Containers is seems to be a very popular thing right now There's a whole slew of new projects around delivering open stack with containers. Yeah I mean if you looked from the past couple years everything sort of starts off as dev stack You know, it's this great thing that you can do or in drop and it's a working cloud and everything's great because it worked in dev stack But once you take that into production, you realize that you can't really manage that But a lot of the production deployments particularly for the smaller clouds look a lot like dev stack We have all of your services running on a single system or very and if you want any sort of ha you just duplicate that over and over and over again But as as we Moved on it became more apparent that trying to run more than one service in the same environment was kind of a hassle particularly as each of the open stack services sort of became a little bit more independent started became becoming more opinionated about their dependencies and Operating at different schedules. It really really started to become apparent that each of these things are are loosely coupled which get to in a moment Service microservices that sort of interact with each other at a API level So the move to take these things out of being a monolithic thing that you just stamp out ten of them Into little individual pieces of services that you put in as many places as is appropriate for that service Is starting to really become clear at least to us and to other people in the operator community? Yeah, our job again spitting out large numbers of small clouds to make sure they look the same all the time So artifacts for us is key And things that we've learned along the way. So the first off was that distro packages. We're not going to cut it for us There are many many many times that we are going to want to add our own Changes to open stack open second is an internally evolving thing And we do not want to be beholden to the cadence of any kind of distro You could you could do your own packages, but as a guy that did my own packages for seven years That's not really a job that you want you want to be focused on getting your changes integrated with your code and out Into your cloud you don't want to be dealing with the the intricacies of how do you build a deb today? Or where did my rpm dependency go to build this thing and where do I put it? And how do I update the apt-cache and how do I make sure that all these things are in sync? That's just not interesting problems. So what's what's the better option here? Okay, how about another poll of the people who are deploying open-stack? Are you guys using distro packages? Maybe raise a hand Wow How many of you have made customizations to those distro packages? Okay, good pretty much So at the same time for us deploying from source is not really an option We have done this and this is proven to be difficult the problems that you run into are right off the bat dependencies so The the Python communities as well as the system level the system love dependencies that you have are constantly moving So I cannot tell you the number of times that we've been bit by some upstream Python change that's happened Or some system level dependency that's kind of moved on where we're not locked down enough The end for anyone that's accustomed to the way open-stack is delivered the requirements that text files are not hardly versioned They kind of say hey, we want this package greater than equal to one dot four and that doesn't really cut it in production Yeah, along with with the problems of deploying the source I mean, that's it's kind of the model you see a lot of of New-age developers or you just throw up a vagrant you get pull your source and off you go but Often you'll wind up with different things running in different places at different times and Craig, you know, he has a really strong desire that all of his clouds look the same Even though they're getting deployed at different times and as a the large huge cloud deployer I want all of my instances that are running the cloud to all look the same because I don't want my my operators and my Admin's having to figure out This version of this library is working here But this version other version of this other library this on this other node in the same cloud is not working And so we needed to have consistency across the board of what we were putting out into our into our environments so with that this was kind of a sore spot for me, so I Worked together with John Dewey and we started a project called gift wrap and gift wrap The whole premise of gift wrap is to take open-stack Which has always been delivered as an integrated release not a product or set of packages Into product into packages that you can consume so what gift wrap does is it goes off to? Garrett and basically decides what kinds of packages it has to use it uses Garrett And as well as DevSec to determine its dependencies out the bottom will land either an rpm a Debian package or now We will we haven't already working in test but not checked in yet a Docker container and that's the direct that's the direction that we're moving in terms of shipping our bits We intend to ship all of our bits in Docker containers So lay down a base OS with ironic because again, we're using images something that we know Then we will lay Docker containers on top of it again something we know and that's easy to upgrade We this was a really good convergence point kind of like you can see on the train map that Craig and I realized that Within rack space. We had an internal tool the public name now for the striker. It's a gonna be thrown over the wall soon I hope but it was doing a lot of the same things a slightly different way It would it would pull down get from upstream open-stack repositories use a tool called And I'm drawing a blank, but it integrates the localized patches that rack space might have into each of those different repositories Do an examination of the requirements files plus anything that we've hacked in that we've realized is a Requirement for this particular project or a certain version requirement and build individual virtual environments for each of our services We wouldn't go down the path of building a package or a container yet What we've done is just build a large tar ball that is all the different individual virtual environments. We toss in our configuration Manifests to configure all those services and wrap all that into just one big tar ball that we toss on our payload server One thing I didn't mention about give wrap give wrap also use the virtual line Technique for ensuring that the services isolated in terms of its Python dependencies So let's talk about operations. How many of you get well a lot of you were installers Do you also consider yourself like long-term operators? How many operators we got? We already asked that question Same hands. All right, cool. I have a short-term memory issue so one thing that was critical for us in stamping these things out is how we get to our initial architecture and We so we announced our our offering for the first time back in a year ago in November of 2013 at the Hong Kong summit and we went GA in May one of the things that got us to where we need to be almost immediately Was we unified our architecture meaning that we could Take any single one of our nodes and run any one of the services on those nodes And in many ways well we in in many cases. We're actually co-locating compute as well as controllers on the same nodes This Becomes problematic when you get into certain types of workloads and scale scenario. So for us We we made a very clear decision Recently to start breaking these services out and make sure that these service are services are composable by themselves So if you have a heavy let's say you have for instance a very heavy Workload in terms of image writing So if you have a 200 gig image every time you write that image to the same controller node that's serving up your API requests You're gonna take a hit so having these these services be composable in such a way that you can break them out into different nodes if you have to is it is a great thing to do so I In my estimation I'd say Going with the unified architecture to begin with is great It's a great vehicle vehicle for getting you started, but it may not be where you want to land long term At the same time you want to make sure that you're not building snowflakes, right? It doesn't every stack again We're in the process We're in the business of building or stamping out these stacks one by one We don't want snowflakes that we can manage just like we don't don't want to manage these crazy dependency chains The same thing goes for how we stand up the stacks snowflakes are not allowed in what we build And with that comes some kind of improvisation, right? So if one thing that I always think about like in terms of the way we built our stacks Because some of them are so small that we're not including things like Swift We're doing things behind the covers that make it look like Swift. So it's what like look like Swift is there So for instance if you think about glance glance can back right into Swift and then you have you can have multiple glance services that are talking back to Swift for Sharing up images because we are private cloud dedicated to one customer They might they might not have the capacity for a Swift cluster in this case We are doing things like our syncing images across different glance notes to make sure that the image is always available for a glance request and My last Recommendation around architecture is that is to always make sure that you're separating your control plane your data plane operations so You will undoubtedly no matter how big or small you are be asked to add more capacity to your cloud and By separating the control and data plane paths When you get the new node addition or quest You will be able to operate on that single note instead of it touching the entire stack So this is a less pretty graph a less pretty picture of very small text and then I could find earlier today This is more of a picture of what a Rackspace public cloud looks like so Craig has a very interesting constraint in that His cloud has to exist entirely for his customers We stamp out a whole control plane for each customer and it can't be co-located with anything else Which means that they have to utilize some of that customer capacity for their control plane Rackspace had something of a Big benefit here in that we had an internal private cloud that we could use to run our public cloud So essentially we had an infinite number of resources at our disposable to to launch our our control plane for our public cloud And the only physical constraint was the number of hypervisors going to plug into it So essentially from day one or at least from the day that I got there all of our services were separated out Every single Nova API was its own virtual machine its own instance on our internal public cloud the The Nova cells the console glance API all of those things if it was a service That ran out of a virtual environment. It was 99% likely to be its own VM We might size it down size it up, but it's going to be an entire VM We also broke things into cells so that we had we could treat a Set of physical hardware on its somewhat on its own, but break apart how much traffic was going on so We had a lot of sprawl of control plane Consumption or control plane instances no matter how small our cloud was But it got us around a lot of the problems that that Craig was having and we could easily target Just Nova API to do something or we could easily target just Nova compute to do something But even this is not quite the best because if for a very small cloud like safer a dev environment We're going to be consuming 15 20 VMs just to manage to hypervisors, and that's pretty darn heavy so again, we're Not we but rack space was going down the road of trying to shrink this down We didn't want to go back to the world where our services were stacked on top of services because that gets really ugly on the system But what we did want to do is take these things that were consuming a full instance Shrink it down into a container, and then you could stack them on on each other But they're separated out enough so that they they almost feel like different operating systems So again, if you can if you can afford it if you can design for it Isolating each one of your services down into the smallest thing you possibly can is Really really powerful and what you can do with the orchestration across your cloud All right, so let's talk about upgrades You know installation that that problem that we solved earlier that's that's the easy part once you get it installed You then have to upgrade it otherwise you get You know lots and lots of bugs pile up features go missing And it's it's kind of a hard problem the longer you wait the the harder it is to upgrade You have to rent that forklift lift everything up drag it over here set it down and pay all the penalty along the way So Upgrading early and often is highly recommended, but upgrades can be kind of hard particularly when you start looking at How to do upgrades at the same time that your customers are using your cloud so In a lot of craigs clouds They're single-tenant. They have one user for that cloud It's kind of easy when you're working with one user on a smaller scale to come up with a good agreement on Certain times of the day that you can go in and disrupt their services to do your upgrades the rack space public cloud had has Something on the order of hundreds of thousands of users. I couldn't get ten of you guys to agree on where to have lunch together There's no way I'm gonna get a hundred thousand users to decide on a good time to upgrade the cloud I couldn't even get that many people decide on what a time is because time is across the globe It's somewhat meaningless So we always have to operate as if there are constantly live users Who very quickly turn into angry users when the cloud doesn't work? That said, I mean it's still the still the goal still is near zero downtime for upgrades So so in order to get zero downtime You have to start diving into things like how do I run one Nova API with version a and another Nova API with version B in the same cloud and that gets really tricky when you start diving into the way that Nova works and the way that Nova does RPC versions for their messages, but also changes the content within the message without any sort of versioning along with it Good times. We also have to deal with database migrations and when and where to restart our services We had a really fun deployment About a year and a half ago that was maybe a month of change didn't think it was that big a deal Rolled through our pre-production. It's just fine We go to deploy to one of our large productions in the database that I think it was the Nova database migration starts and goes and goes and goes and goes And it goes some more Eventually took about four and a half maybe five hours to to migrate our production database in the way that Nova works when that Migration is happening. You can't really have any services running So our cloud was just nope. Sorry go away for four hours. That's really really long time to have an outage and so that hurt a lot and started us down the path of actually checking how long migrations are going to take when changes are made upstream and Checking how long they're going to take on our backup Content so that we have a much better idea come deploy time as to how much outage We're going to have or whether or not we can do any adjustments to that migration ahead of time to make it hurt less The other takeaway here is prune your data, right? So yeah, so over seconds notorious for soft elites and if you can Please prune your data before doing the migrations Yeah, one of the big reasons why the migration hurt is that you know We had the very large cloud that had been running for a year and a half or more at that point with Hundreds of thousands of customers doing whatever they do with the cloud Which is lots and lots of creates lots and lots of deletes and we didn't really have a policy for pruning the instance table The instance data table. So that thing was huge. It was like billions of records or something odd like that so if we had Had a policy in place of when we can prune how much we can prune and where we would do long term storage of the Stuff we pruned out it would have been a much less of a problem So there's there's a lot of headaches you can you can fix Beforehand if you really plan and think about the fact that once you have it installed you're gonna have to upgrade it so one thing that That also kind of touches operations to some bit is is how does how do your user interact users interact with the cloud that you've built being Someone who stamped out a lot of these small clouds. This is something that for a long time was landing on my lap and and now lands in support In our support team slap, but things that always stood out to me are these are the types of questions that I would get from users all the time is I Actually can't believe how big this one is is how do I create images? So a lot of people who have a red hot red hot base release. They don't know how to create an image They're coming from legacy it shop They want you to kind of help them to help usher them along So you should use tools like packer or image disk image builder show them how to do that and make itself service for them I think this is probably a frustration for a lot of people running operate operations both from an operator's perspective as well as from a user Which is the CLI so CLIs? are They are there's no consistency a lot amongst many of the CLI so For instance, I'll do a nova list to show instances, but I'll do a glance image list to list images Why is it not nova instance list? Some of those small things as well as if you think about if you've ever used quotas before quotas use a demand a tenant UID but they will accept a tenant name and the tenant name does not work for adjusting a quota Along those those are just some of the customer-facing ones that that do get people angry on mailing lists And there's a lot more users than their operators Under the covers. There's some operator like CLI things that are just just awful So in the in the cloud that I was helping deploy we had five major databases that we'd have to migrate during a deployment There were four different ways of migrating those databases So every single database every single service was a special snowflake in our automation Which made it very difficult to do some of the fun things we wanted to do when every single thing had to be different Had to be an exception and it just added more and more and more lines to our automation tools It just made it harder to make any changes to it At the same time we have a lot of users again coming from a legacy legacy IT background that are they want to use Horizon as their interface to open stack Horizon Also exposes things, you know if you're you'll expose things that they maybe perhaps shouldn't see because they're there, you know They shouldn't be messing with networks for instance or it doesn't provide some functionality that maybe they want to have I once had a customer that We were doing a migration for and he I said well what we'll do is we'll just move those We'll take those instances from this host and move them over here or retarget them to a different host He said well, I can't do that because it's you know, I haven't I've never seen that tab in Horizon I said well, let me show you the CLI and we can actually do this and move these instances over And then the other one that we get all the time is this one is no valid host who's seen no valid host before and Who's ever figured out exactly from just looking at that what the problem was? Really can we hire you Because you know your cloud very intimately, right Yeah, yeah, yeah, so this takes debugging but for from a user's perspective, this is a terrible user experience People will try to spin up, you know, we have customers who use static IPs or they try to target an availability zone that's not present and they'll get this this one error wrap must wrap dozens of actual underlying errors and the way we deal with this is we're using A log aggregation through log stash to expose this through Kibana So a support guy can can look at this and say oh this request for this instance. This is exactly what happened But even at that that's not really that great We plan to extend that to users so that they can have a dashboard in Horizon or in our own Custom UIs that will show them the lifecycle of it of a VM But again, this is something that we should fix an open stack the rack space solution wasn't quite as elegant We just shunt all of our logs to the same server and we have grep for instance ID as a thing that we do or for reference ID but of course those those request IDs they're Unique for a particular service, you know Nova has a request ID when you want to build a server But it's gonna ask for a network from neutron neutrons gonna have a different request ID So again, you end up having to play grep the rabbit down the hole to find out what actually happened There's no real real easy way to instrument a process from customer request or end user request all the way to fulfillment or error so One thing that Opusseq has taught me is that these things when they fail they will fail dramatically and For certain definitions of fantastic, I guess There's been There was a talk earlier that was what was it it was many hours of Many hours of boredom punctuated by moments of sheer terror Which I think sums it up completely Most times things are working really really well, and then all of a sudden something happens into like what yeah in this case train arrived at the station great and You know, I think it one thing that was also pointed out in earlier sessions, and I think this is true is that You need people who are who are very who are very broad and very deep So the we're dealing with lots of lucid couple times loosely a couple technologies that you know You need someone who can dive in and figure out where the problem is across the entire stack, which is Typically a very hard thing to do So let's talk about one of the failures that we had So this was one of the one of the minor ones, but this is still fun so when we were dealing with Havana neutron Havana Specifically we decided that we were going to take a stable update So we were going to move forward on our stable branch and we were to deploy neutron and things would be great Because we would pick up all these changes one thing that was not clear to me is that apparently in stable branches we changed the fall configurations and What that meant for us is that when we changed the fall configure configurations for neutron all the agents Would actually start flapping and this is only because we were using one of the options not both of the options the two options are Agent downtime and report interval and for me. I don't know as an operator I don't think we should ever change the faults in a stable branch, but this is what we do So another great example of broad but deep right So we have to have broad knowledge of how to deploy lots of differently independently working services and how to manage you know packaging and and Upgrading and database and all these things, but we also have to have deep knowledge of things in neutron Like what does agent downtime mean or what is instant or the check-in time mean? What real impact does that have if those two numbers change right? So how do you have that very specific knowledge, but also have the knowledge to do everything else above that? And more neutrons I'm not trying to pick on neutrons here, but so first of all No matter what configuration dealing with an open stack It's typically complex and the lesson that we've learned here is that you only ship the config that you need Not the ones that can get in the way and like Jesse said really understand The config options that you're using so this is one that bit us really really really hard We decided that we were going to move the VX land some time ago And we did all the work and we got it all working and we went to go deploy to one of our first customers And things went horribly wrong What we had fantastic fantastically wrong So there are two options in the neutron config and you can see this is now fixed I pulled this from the latest source code for I believe it was this was from my source I think so there are two options one was enabled tunneling and the other one was tunnel types And you could select VX land GRE whatever so we were in the midst of wanting to deploy VX land So we said okay, let's Let's start doing the right in the code and we wrote tunnel types equals VX land and at the same time We set enable tunneling equal false. Well, you would think that that would do nothing that did something much different than that What that ended up being was 45 minutes of entire data center downtime and for our hundreds of customers and For hours for that one particular customer it ends up creating VX land tunnels and we were in event Essentially looping the network over and over again All right, so I'm not gonna pick on Neutron, but I'm gonna pick on a few other things So in our in our wonderful world of scale We have a couple of problems that we ran into and had to figure out how to address So Nova compute is this wonderful cool thing that runs on all of your hypervisors and what it it's kind of stateless but kind of stateful It gets away with being restarted in that if you stop it It forgets all of its date and if you start it it wants to remember all of its date And the way that it remembers all of its state is it goes to the database and says tell me my state and tell me the State of everything that happens to be sitting underneath me Doing that on one Nova compute that's got five VM sitting underneath it. Not a big deal We were doing it on you know 7,000 Hypervisors that had hundreds of VMs under many of those and so In our ever-going effort to do our deployments faster We got better automation that led us to do this on more things at a time And we end up killing our database because all the Nova computers would come up and say tell me everything and the database Would just lose its mind So that was fun that we had to to learn about how to do a much better job at bringing up our services We also had a lot of fun with glance glance has a few interesting properties to it first off glance API acts as intermediary between your Image storage land and the hypervisor that wants to use that image and every time a hypervisor through Nova compute asks for an image Glance figures out what that URL is and then Fetches it somewhat from that source and passes it kind of through so your your API ends up being an intermediary For the data that's coming across as well Which means that if you're fetching a whole lot of images like say for 10,000 hypervisors you have to scale out your API nodes through some some interesting reasons we end up having to List out all of the API nodes in each Nova compute config file So that if one failed it would find the next one and find the next one, but have a specific address to get to it That meant that anytime that we scaled That we needed to scale out our API isn't had a new one We had to go and touch configuration files across all the Nova computes C problem a A little bit more fun with glance We thought it would be a good idea to rotate our passwords for some of our services one of our services being swift Swift being where we were storing our our images So we updated the the swift password, and then we updated the password in glance Everything looked great to start with and then we started getting a few build errors for some of our older images We weren't quite sure what was going on weren't quite sure was going couldn't figure out couldn't figure it out eventually figured out that Swift and it's there. I'm sorry glance and it infinite wisdom was storing the path for the images in The database where it stored them that path included the password So You either have to trawl through your database and change all the passwords for all the images or you just don't rotate your passwords I think this is still an open bug It was kind of fixed and then somebody said but wait, what if I have multiple places with different passwords and we just sigh and then finally the The other thing that you have fun with at scale is is the cost of introducing a new feature In our efforts to make our upgrades better We wanted to make use of conductor for compute an external conductor shields the computes from the database rights all the computes instead of talking directly to the database they send a message request to the conductor and the conductor will Talk to the database on its behalf. It also shields it from Internal object version changes if compute gets an object from say Nova version Five, but compute is running version four compute can ask an external conductor to please translate this into something that I can understand And send it back to me as long as your conductor can talk both five and four that'll work really really useful feature for doing Rolling upgrades, but there's a bit of a cost It takes all of your computes that were independently writing to the database and instead shunts them all to the same thing to write To the database so you should probably have more than one You should probably have more than two and when we're dealing with 600 some odd hypervisors per cell You should probably have like three or four or more we think When I went to go stand this up in our production after making sure it worked really really well in in pre-production We ran into a problem that I kind of mentioned earlier every service gets its own VM and we needed to isolate these per cell so that Only the computes in a particular cell would talk to this particular Conductor because they all use the rabbit that's per cell I mean I needed to spin up at a bare minimum two conductors per cell which meant that in one of our larger environments I needed to spin up. I believe 50 new instances in our internal cloud With at least two CPUs in those instances Unfortunately, it'd been one of the clouds that had been around for quite a long time It hadn't seen any capacity increase for a while because hey cloud is infinite, right? Of the 40 some odd that I needed to create of the 50 that I needed to create I think I got four before I ran out of capacity And so now we have to make a significant or somewhat significant Investment in the under cloud in order to roll out a new feature that Saves us a few minutes here and there in some of our operations. So as you scale up You have to kind of worry about some of these things of how will you scale out? The thing that you use to run your cloud as you scale the capacity of your cloud So then I tried to come up with a couple of gotchas that we had run into along the way One of them that we ran into was DNS resolution hostname. So make sure DNS resolution and hostnames are always correct Your stack will look like it's working and it will it will behave like it's working until it doesn't We so in particular we had it one pretty way out of bug with our ansible code that was actually Just mildly or writing the wrong hostname into the Etsy host file All kinds of things on the back end will be affected by that. So be aware of that make sure that's right We had our own little fun with DNS and that a while ago We decided we wanted to start using DNS as a way of addressing all of our control plane instead of just using IP addresses DNS is a little bit easier to speak a little easier to understand what's going on And at the same time we were we were building up the automation for standing stamping out our new cells and our new environments And part of that automation was launch an instance in our under cloud And then once you get the IP address that it got to create a DNS record for it Well fun little thing about DNS and a records is that you can have more than one of them so we sometimes got more than one of them and Part of our launching our instances, you know We embed an SSH key in it and there wasn't really a whole lot of other Validation that I'm talking to the thing that I want to talk to other than it accepted my key so occasionally we ran into some really really fun things where a Our automation ansible would try and connect to a particular system that it thinks is maybe a nova compute looks up the the DNS record picks one of the maybe and it got back And logs into it sometimes it wouldn't work and that was okay We always tolerate systems that we can't connect to because you know when you're at the scale that we were at We always have some things you can't connect to But sometimes it would connect in and it would be great until we started trying to do some somewhat destructive things And then we realized that oh, I'm not really talking to a nova compute. I'm talking to a database node Boy, it's a really good thing that we have database backups So after that we kind of went down the path of let's do some audits and make sure that we only ever have one a Record for each system and that that a record is pointing to the system that we think it's pointing to so really important That DNS is doing what you think it's doing Another one that we've run into is service ordering so because we're not using distra packages We are crafting our own upstart scripts And if you don't have those right you will not know until you reboot that host So in one particular instance, we were actually starting up We're launching Restarting the instances after reboot before the networking came up and as soon as you do that the instances are instances are unreachable And probably have to be rebooted anyway. It's just it's something that's fixable But at the same time something that's not optimal Service ordering is also important not just when you're rebooting a machine that has a whole bunch of services on it But when you're orchestrating how you upgrade your services in a rolling manner we Recently introduced grace will shut down of computes and in doing that we wanted to Start addressing each of the services in somewhat of a logical order and we thought it would be great if we on the way to doing a migration we shut down All the API nodes so that no new requests come in and then we shut down all the computes gracefully so that they could finish Whatever they were doing and then we can move on with our day So we shut down all the services and then we asked all the computes to gracefully shut down and all the computes took a really long time We couldn't really figure out why that was until the light bulb went on and somebody said hey Don't those computes that are doing things need the control plane to do them So yeah, we should change that order So ordering how you bring down your service and bring up your services does have kind of an impact and what's going on There as much as everybody likes to say services are magical start them all up and they'll sort themselves out And it doesn't really work that way order is important Yeah, and the next two. I think we've we've already talked about logs I think everyone knows that logging is a problem. In fact, there was a session on Monday to rethink how we do logging I would recommend using some kind of log aggregator at least At the same time actions are hard to trace across the stack We talked about that before as well RPC failures So this is this used to happen to me much more not so much recently But it's very difficult to find the status of your services when RPC is not working correctly One technique that I've used there is just to use Nova Manch because you're going directly to the source of truth Nova Manch service list or something like that One thing that bit me recently Is that it is database backups? Who here thinks that backups are a good idea Every hand should be in the air, right? So we're all gonna have something like this So we had three weeks in a row of Friday at 9 p.m. My time in Boston For this one customer that was constant like key someone just fall over in itself And I had no idea what was going on turns out that we had this errant cron job on there because it was our first customer hadn't cleaned it up and We just had backups going on top of backups on top of backups and as soon as that happens all bets are off If Keystone's not happy ain't nobody happy So we had our own fun with with backups it was similar But we had an environment where and I'm gonna pick on it But neutron wasn't able to deliver IP addresses in a reasonable amount of time And nobody could really figure out if it was sort of the environment that was the problem Or if it was the service is the problem because it seemed like the problem would follow with our Versions as we bumped our versions, but also we wouldn't experience it in some of our other environments So we had a really to hard time with you know Validating the environment validating the new code because sometimes it would fail and sometimes it would work And we really couldn't figure this out until we started doing some real deep analysis and we didn't have a good We didn't have a very good correlation system Which would have pointed this out pretty quickly, but what was going on is that we had a scheduled database backup That would hit every hour And in this in particular environment We didn't have a slave database set up for that backup to hit so our backups were hitting the production database And when the backups would hit the production database neutron would get unhappy with its connection to the database And it would just slow down and not give out addresses So any builds that were happening like in that hour or the 15 minutes after that hour that took for that backup to Happen it would just have a bad time, but only one environment didn't have that setup And that's the only environment we could see it and it only happened on the hour So finding those types of things is is really kind of difficult But since we all agree that backups are good You have to be mindful of what that impact is going to be on your backups And you have to be mindful of not doing you know four backups at once of the same database And the last one that that kind of annoys me a bit is database modification So there are many operations that you undertake as you're operating cloud where you're going to have to modify the database the one that that we ran into most recently is We deploy sender with a default back end and we don't define multiple back ends typically But in this particular case we were adding an additional back end As soon as you do that you there's no easy way to migrate to two back ends instead of just one If you look at the sender database, it'll typically be host name at sign and the name of the back end So what that meant for us is as we added the new back end in Sure, we have the new back end working But we have to go back and retrofit all of those volumes that are already assigned to the previous back end So something that's not optimal Fixable certainly, but not optimal so who here thinks it's a really good idea to have lots of people with direct database Right access to your cloud databases Great. We're all on the same page here And Then we just so these are the references actually the second link is wrong So you can just tweet at me if you would like to get the correct link there. I just realized that now As much as we've kind of crapped on open stack I I did mention to Jesse earlier when we were doing this talk That I was having trouble coming up with really really really bad problems that we had it was You know When we did this abstract I was like, okay in the next six months like or the next couple months We're gonna come up with more stories about things that have gone wrong But for the most part I've had sleep. I've been sleeping. So that's a good thing Yeah, I mean we on on the rack space side of things a lot of our pain was because of our success You know that we had problems from having a cloud this really really big But we wouldn't have a really really big cloud if it wasn't a really awesome thing that people wanted to use So even though we have a lot of problems, we're still really really grateful that we're working on on pretty awesome stuff We just got to make it even more awesome, right? I just said that a second the the Haha That's an engineering solution right there. So I'm gonna move that over It was in a different repo and I plan to move it over and I totally forgot to move it over So I'll get that for you But it will be there And the obligatory we're hiring slides. So If you want to help us make open stack awesome for small or big clouds come see us So this is the last session of the day We are standing between well We're no longer standing between you and beer, but if you have questions feel free to come on up We'll we'll talk them out