 All right good almost afternoon to everybody. My name is Chris Blumentritt. I'm a systems engineer with rack space today I'm gonna talk about our path to less impactful deploys So we deploy relatively frequently we stay pretty close to trunk Deploy about once a month on average and we're deploying to six regions So what do things look like now or in the recent past? It was pretty standard you know standard operation where we would Call maintenance window deploy our code and Go from there and taking a closer look at that What we do well we but before our maintenance window. We actually Stage all of our code out to the servers and that code consists of Python virtual environments and the configuration management to to make the change when we when we upgrade So once we enter our maintenance window we stop all the services or we stopped all the services And then we switch to the new version of the code Migrate the databases and Then start of the services and we start the code the We start the control put the control plane first and then we move on to the computers and So why are we stopping all these services? Well, we got a couple things going on during the upgrade. We're gonna have servers services in the old version of the service services running the new version of the code and When you have a mismatch in services Strange things happen You're you're gonna see things randomly fail and it's just a big problem So we want to prevent our miss mismatches there and also the database schema changes And if service tries to record some data to the database the You you might run into an error there and a side effect of all this is when we stop the services any Any tasks that are running Are dropped So that will lead to Failures there So let's take our user Ada She were to boot a server directly before the maintenance starts then She's eventually gonna come back with a status of error It's good the her Her task is gonna be interrupted and it's bad news Now during the maintenance window when everything's down then Well service unavailable or no connection or something along those lines And as you might imagine this makes people relatively irritated when it happens to them So What can we do about that? Well, it was a conductor service that's been around for a while I think initially it was only to handle databases our database access, but now it can also translate versions of Objects that are being passed around to do whatever tasks that need to be done So yeah translate the objects between various services or and proxy database access and what this allows us to do is Separate the upgrade of our control plane in our computes and this is important Because the computes have are you know the major portion of the environment So let's take a look at translating the versions here So if you've updated your control plane and left the computes an older version So you have the control plane at version 1.3 and your computes at version 1.2 Request comes in and and Gets Scheduled down through the scheduler and the cells and all that down to the computes and the compute unpacks the message and sees that it's the object version is Is newer than it can handle it will Raise an exception and hand that off to the external conductor and then the conductor will backport that down to down to the version that the compute can handle pass it back to the compute and Whatever tasks happens happens Proxying database connections. This is pretty standard in that the conductor When the compute does whatever tasks that it needs to do it will Notify the conductor it's done it and the conductor will will update the database accordingly So this conductors become pretty Important piece of the infrastructure pretty quick. So we got to make sure that it's gonna work So conductors by default have One worker per CPU on whatever server it's running and you can scale this out scale the conductors out horizontally Adding more conductors as needed So you're gonna want to watch your queue size that the messenger or that the conductor uses So if messages are coming on faster than they can be consumed you're gonna run into some problems there so you should Add some more conductors and of course you're gonna monitor your conductor CPU and memory For in all the standard things that you do for monitoring servers One more piece that we've added is the graceful shutdown of compute. So before we're sending a kill signal To everything. So any tasks that are going of just just dropped on the floor So we've Instead of doing that, let's let's use the init script and shut down gracefully for a change And this will allow some quick running tasks to finish. So booting an instance Resetting password So on and so forth So but long tasks like image image snapshots or migrations or not migrations Resizes will still be left on the floor So with these pieces in place Let's Re-orchestrate and I guess I should pause here and talk about how we're doing everything We're using since we're using puppet in a masterless fashion. We need something to Kick off everything and we're using ansible to do that so When We stage the code like we have we're before before our maintenance window It's all ready to go on the servers and then once we enter our maintenance window It is it's time to stop just the control plane and this is a subset of all the servers So we're gonna be stopping tens of servers instead of thousands of servers So this will happen a lot quicker Then we're gonna switch to the new code and run our my our migrations as we need to and Then start up the control plane now at this point we're receiving requests again computes are able to handle those requests and We can now move on to a more graceful upgrade of all of our computes and So the control plane outage is a lot smaller but the upgrade of the computes take a lot will take longer maybe not a lot longer but will take longer because we're doing it more gracefully allowing tasks to finish and So that piece of the upgrade takes longer. So now what are we looking at? Well, if a to makes her request her boots her instance directly before the maintenance starts There's still an opportunity for failure there But since the window is so short There's a good chance of success, but we're still an opportunity for error there and Again, if she makes the request during the control plane outage Well still gonna have a 503 or You know connection refused and Then She makes that request during the control plane upgrade There's a high probability that it will go through again because of the graceful handling of compute shutdowns So we're better now There's still some impact, but a lot less impact and that's where we are right now We were hoping to have it done before the summit, but some things came up and So it's still in the testing In our testing environments So let's you know, we're still on the road. The road hasn't ended yet. So let's look down the road What can we look at next? Well instead of just applying everything to computes we can start rolling our other services into this One reason why we were you we haven't done the other services yet is I was talking to one of one of the developers who said API still doesn't quite do conductor right. So that's all I needed to hear But once that gets cleaned up then we can apply that to the other services and allow them to Allow them to handle newer versions than then it can then it knows about Database migrations Depending on the database migration It can go really really fast or a long long time We've seen both we usually see fast, but sometimes we see you know minutes Even tens of minutes on occasion So what can we do about that? We can look at the Expand contract pattern of database migrations wherein you have a script that will expand the database Up to what the new version can handle, but it does not Mangle the data or doesn't change the database so that the old versions can't do anything So both the new version of code you're updating to and the older version of the code you're updating to Can talk simul talk to the same database So so you won't have an outage there and then lastly there's the interrupted tasks I've seen a couple things in previous design summits where people proposed talks on that and I believe it was NTT and I can't remember who else proposed something for task flow. That's working with some services, but The Nova services is on their their radar to do next and so if we can have some sort of functionality to Pick up tasks that are interrupted There's Less impact there So as far as the migrations go We have some people starting to look at that and within rack space and hopefully we'll we'll have something soon So how might you do a control plane upgrade now? Well, you can roll through those front-end services API The front-end API and the schedulers and whatnot and you know APIs are typically inside of a load balancer or for us they're in a load balancer take them out of a load balancer and then and You know upgrade one at a time and shove it back in the load balancer and you're good to go You could do some sort of a be old new situation. So you're old services running you prepare some new services and switch those that deploy time switch to those services and And Be good to go so in this scenario if a door to make her request directly before the maintenance starts there's it'll get consumed by the or the API will take it and stopping the services in the control plane in a in a In a rolling fashion, there will always be something listening ready to Do whatever work needs to be done and She'll eventually get her active server Now if you were to make the request while the control plane is being upgraded again, there's still something listening to it at all times. So That pat that will get get handled active server hooray and Then in the third case during the compute upgrade if we can if we can move to a Place where we can pick up tasks that have been interrupted then you know, she's her server is provisioning on a hypervisor that the compute stops on the compute server stops on it starts up sees that it needs to do a task or Sees that it's got a task going we'll pick it up and continue or in the case of image snapshots which can take a long time depending on the size of the Image pick that image back up continue on with the snapshot and you're good to go that makes for some Pretty happy customers and I don't even know if happy is the right word if it just works then you're content Right. It's all good There's there's happy or there's either it works or you're mad so The future so if you have all this stuff in place, where can you move next? And this is way down the road. This is when the mountain still looks pretty small Daytime deploys If you can deploy without impacting your users Then you can impact or you can you can deploy during the day and not staying up till three or four in the morning Sounds pretty good. It's time to start thinking of ourselves That's great and then so and then if you're confident in your non impacting deploys No, no impact whatsoever, then you can move towards continuous delivery To get to this point you've already got many of the pieces for continuous delivery into place we We pull down the code from our up from the upstream master We merge whatever Rackspace-specific stuff in And then we start testing it hits an environment You know, well unit test run then it hits a small environment to run some tests and everything good Or and if everything is good, then it moves on to a larger environment for more Integration tests so on and so forth so if you can if you're confident in that process and Your deploys don't impact users at all then Deploy all the time so and that would I guess maybe not the end of the road, but that's a good place to be so With that That's it for me That went about seven minutes faster than when I practiced it so There's plenty of times or plenty of time for some questions So the question was is when is the database migrations happening through the rolling through services? And this is this was down the road. So it's speculative But if you have the if you can expand your database so that old versions and new versions Can can handle this, you know writing to the database at the same time then they can happen Pretty quick quickly one whenever Is that just something that you have to manage on like a migration by migration basis to sort of like in code review audit? If this new schema is going to be compatible with old code, yes I had a question about the interruptible tasks like what are you guys seeing today whenever? You're upgrading the compute nodes and you know the lib vert layer or I guess then is Those in API calls are still being processed by the back end So the compute controller reboots comes back up. Is it not recognizing that this task is in progress and making You might see a task the task end up as in Air, but the recent reboot still happened. So on one hand, that's good on the other hand That's not good That might cause you to reboot a second time You know one of the last times we did it, I don't have the solid numbers. I think we We did it on four or five thousand nodes Compute nodes and we saw a hundred and seventy interrupted tasks Are on a hundred seventy seventy hypervisors that had interrupted tasks. So I mean it's Well, it's not huge That's still gonna annoy some people If it happens to them So the question was is the graceful shutdown in upstream Nova or is that something we had to? To do ourselves and the answer to that is I'm not sure we wrote our own in its scripts and So the in its script that we wrote it does handle it We just didn't really utilize it because before when we're doing the Deploys We just want to stop everything as fast as can we as fast as we could if we couldn't you know If the faster we stop everything then the faster we can start everything and that that was the Yeah, that would that was kind of the way it worked and when the environments were a lot smaller That was okay. I don't know that one of the one of the other guys in our team figured out some way to to Determine that There's a two little things there the base in its scripts for Nova when they're stopping have a very short Time frame that it allows it to stop. So when we recreated the in its scripts and Changed on the base thing that we added to it was a rolling timeout window So we can actually set a value and allow it to slowly shut down and as part of that It's actually got a query and using rabbit looking into the queue and Verifying that it's clearing and trying to determine if there's anything that's long-running or if there's anything that's short and active So we can actually clear it as needed at this time. No, we have not released them They're still in testing with this and our test environment verifying that everything's working correctly for us to use the conductor to Do the actual rolling restarts? Hi, thank you So if I understood well, there is currently no way to avoid interrupting the service at least during the update of the upgrade of the database and I was Wondering if we could anticipate the duration of this upgrade Anticipate the duration of the upgrade process of the migration of the database You can or and we do the way we anticipate that is We a couple times a day. We'll take our production database down or to not take it down oops Take a copy of our production database and Import it on to just a server somewhere else and then we'll run new migrations Migrations that are in the new code against that so we can see how long for for each region that we have how long migrations will take Thank you. We have deployment with compute nodes and block nodes Say that one more time We have deployment with compute nodes And separately block nodes with volumes Okay, instance connected to volumes through ice-cazi After upgrading and rebooting I should I must Relagin to ice-cazi sessions for current Instances, how do you do it? So how do we handle? ice-cazi connections Block storage can so and our we're not rebooting our hypervisors We're just updating code the code that's running so those connections should stay in place Why you don't Because we don't have to It's just a service that's going we're not You know upgrading say you know doing a kernel upgrade or something would be a another mechanism This is just about updating the open stack or the nova code. Okay. Thank you We have a team for that So there there is a separate team for that and So what they what they'll do is well, they're pushing hard for Migrations no downtime migrations of instances which is Currently being tested so the idea there is We would You know move a move an instance and then do whatever you need to do against that hypervisor now in our Particular environment. We're also using Citrix Zen server so The the instances if you're if you're upgrading kernels on those You know during a maintenance window you can take just that that Compute no down. It doesn't affect anything. That's running on the hypervisor and and You know upgrade that stuff accordingly so the it's it's when the You know if if the hypervisor needs of kernel upgrade or something along those lines you've made a couple month ago or so seen a lot of talk about that it with rack space because how to do that in an emergency fashion you have any examples of Conflicting Python modules during an upgrade like oh we upgraded glands, but Keystone has a conflicting Python module and dependency hell So this guy right here 330 he's talking Well, no, I can I can finish that out though. We're running stuff within Python virtual environments and No site packages That's actually pretty clever So what we're doing right now is is there's some Nova config options There's there's a conductor section and you can use local true or false and if it's true then it will Find it on find the the correct message bus to use now The second part of that was had to do with migrations Or database In order to do the translations right and then so that was a that was part of the the down the road thing for Expanding contract database. So if you can before you do your Deploy if you can expand your database out to where it'll know how to handle the new version So let's take for example Maybe you're splitting a column into two columns So you would Run the first part of your migration that would add those two new columns But your your old column that had the data in there is still there Now if you if the developers want to be extra nice, they could write their code to look in both places for it But if not what you would have to do is So you would expand out the database and during the migration time Both can talk to it. You might have a little mismatch. So this is you know, this is all speculative And once you're finished with the migration then you would run a script to contract the database down So that would either that would move data from You know your one column to two columns and the example I gave and then once that's completed drop remove the column in the database and that's how you would Get to a point where you can have two different versions of a service talking to a database All right well Feel like a all right class go to lunch early