 My name is Forrest Carpenter. I am the engine one of the engineering managers with piston cloud computing And this talk in case you wandered into the wrong room is called upgrading a place from grizzly to ice house a cautionary tale There may be some non-zero number of you out there who are thinking This shouldn't even have been an issue. You just stay with whatever the latest stable releases and you never have to worry about this sort of thing That is a religious war that I don't want to necessarily get into But there is also a bit of an Easter egg later for those of you who have that opinion and I will point it out In the case of piston we needed to accomplish this and it's what almost all of our engineering team had to do from roughly February through July of this year and So this is basically going to cover what we had to do to make this work What the pitfalls were what the lessons we learned were and that sort of thing Just to sort of give you a bit of history and April of 2013 piston released 2.0, which was based on Folsom This was shortly before I joined the company in May at which time they had already upgraded the piston code to grizzly So that in August of that year piston 2.5 could come out and at that point we were actually current with what was released upstream Grizzly had come out in April when piston 2.0 had come out and then 2.5 followed and brought us up to speed However, as you can probably guess Havana was on the horizon Originally we planned with the next release after grizzly to continue keeping up with the upstream Open-stack releases. However at the time when we released 2.5 We decided that there were some other features that we really wanted to focus on some expanded disc pool support dashboard improvements Multi-region support that we'd rather focus our resources on getting those features right than necessarily just catching up to Havana So rather than 3.0 being based on Havana. It was also based off of Grizzly We ended up releasing 3.0 in February of this year. So 2014 after at which time Havana was already a reality and Actually was halfway through the release process to ice house. I'm also going to stop hitting myself in the microphone battery pack So 3.0 released in February as I said and we were already deep in the throes of planning the next Release and there was still that lingering ticket upgrade to Havana who should do that. Should we do it now? There's someone probably in the audience who I will leave all of blame on he's waving his hand back There frantically mr. McGown who I think is the person who decided to cross out the word Havana and write the word ice house in there Since ice house was due to be released any moment now So three whatever would not be Havana. We would skip right ahead and base it off of ice house So there's that ticket sitting in our system as we're planning the next three dot whatever it would be release 35 in case you wanted to skip ahead And that ticket got assigned to an engineer And that engineers responsibility was to take a couple of days and size it research it figure out what was going to happen And then come back and you know, we could actually put it into our planning queue That engineer who will remain entirely anonymous for his own benefit But who is a handsome devil? As I said was not around when we upgraded from Folsom to Grizzly and therefore had no first-hand experience doing this Of course Christopher had experience and so did all of our other engineers so I spoke to all of them or I should say the anonymous engineer spoke to all of them and Reviewed all of the tickets in our system that pertained to that upgrade from Folsom to Grizzly and After you know soul-searching and analysis came up with the estimate One engineer two weeks grizzly to ice house That would be one engineer Two weeks to have all the upstream unit tests all the piston unit tests all of Tempest all of Pistons functional tests suite all working all passing and being able to upgrade a running piston 3-0 Grizzly based cluster with production workloads to a piston 3x ice house based cluster One engineer two weeks As you can see the punishment for my crime is now I'm here and I have to tell you all about it So let's start with what actually went well What went easily what actually went according to plan configurations were simple to migrate We actually we petitioned them off rather than one engineer in two weeks. We handed, you know each service out to a Different engineer Nova Cinder Glantz Keystone Horizon Quantum Each one Migrated the configuration generation portion of our orchestration software. No problem made us very optimistic Which was not necessarily the best thing for us to be because we immediately ran into new problems Like I said the point of a piston Release is not to be a monolithic thing that gets installed. I mean yes it does install on bare metal and that's and you know, that's the point of piston but Each of our releases is upgradable to the next that's part of the the selling story So we can't just have an ice house release and then sort of make it upgradable We need to approach it as the upgrade is part of it immediately So with our configurations in place we thought okay, let's just build it and see how well this goes And yet we couldn't even build the product yet two major reasons we couldn't build the product immediately were migrations and dependencies The migrations of the configuration files I already talked about the but the migrations I'm talking about now are database migrations and specifically the open-stack services You know my sequel databases Essentially what we found was we were skipping a version. We were grizzly to ice house. We're ignoring Havana and The way that database migrations are mainly handled Release to release is that you're expected to be able to upgrade from say Havana to ice house or from grizzly to Havana or Folsom to grizzly But not to jump generations and more over Someone Apparently decided that instead of keeping all of these chronologically incremental database migrations Separated it would be much much easier to squash them down into a very large monolithic database migration Apparently this had something to do with speeding up unit tests I don't know who the culprit exactly was on that one, but the point being that this streamlining actually compressed all of the migrations into One code path and in the process of doing so they decided to optimize the logic and Actually lost certain steps of it So when we went back to try to tease out exactly the differences between grizzly and ice house We found that not only were we having to pull apart this monolithic migration to just to get 22% of it the last two steps But that there were pieces missing in between And so that process which we expected might take a day or two actually took an order of magnitude longer And was the first thing we ran into in the first sort of red flag that went up and said this is going to take longer than that Estimate of one engineer two weeks Getting past migrations the only other thing that was holding us up from actually creating a build And playing it testing it was library dependencies This is a bit of a cheap shot, but does anybody see the problem with this line? This was as of stable ice house last summer a Valid dependency in Keystone. It's also a logical fallacy and Like I said, it's a cheap shot, but it was there and it was one of the many dependency things We had to rationalize we try not to pin dependency versions We try to have all of our Services both open stack services and other service services that we deploy in our clouds They all need to play nice together our hyper converged architecture means that every single service is running on every single node in a cloud and So to that they all have to use the same runtime environment they all need to use the same libraries and It's always an exercise for us to make sure that every single service is able to share Dependencies and not collide with each other in this case Keystone collided with itself In this sort of microcosm of what we were trying to avoid So rationalizing all of the dependencies given that all of the open stack services were independently you know developed was a time-consuming task and one that Hopefully is made easier now that We have things like the Oslo consolidation of common functionalities The This is the part where I'm going to Refer back to those of you who think we'll just stay with the current System and everything will be fine There is currently or at least as of last week before I got on a plane a dependency in Keystone that requires a Juno library Or the unit tests don't work. That's the ice house Keystone requires a Juno library or the unit tests Won't work. So even if you're staying current dependencies are still for some reason a boondoggle that we constantly have to tease apart So I mentioned the Oslo consolidation of Features and we absolutely applaud that design decision. We think that's fantastic However in yet another sort of you know cheap shot to the Reminiscence of the Keystone one this change Oslo dot Sphinx to Oslo Sphinx cropped up right in the middle of what we were doing and yes, it's easy and it was fixed upstream, but I just wanted to You know One more thing we had to deal with was all of a sudden. Oh look, this is gonna break Oslo itself is actually pretty good, but Oslo messaging gave us the biggest problem in Rationalizing these dependencies and also just getting this cluster to work because once we could get Oslo messaging installed We found that it didn't have features that the previous iteration pre Oslo the Nova messaging did contain essentially is everyone familiar with the Piston parable of the puppies and the cows Okay, let me give you the brief version of it then Coined by one of our co-founders or one of our founders Josh McKenzie essentially You can treat your servers as puppies or cows if you treat them as puppies You give them names and it takes several people to take care of them and if they get sick It's extremely expensive if you treat them like cows You can have lots of them. They're numbered if one of them gets sick you shoot it in the head and A couple of people can take care of hundreds of them while drinking whiskey That is the puppies and cows analogy and that's how we tell our customers to treat their hardware But we at Piston also treat our services and our software the same way And so with all of the services running on all of the machines We need to be able to migrate services move them around and not particularly care where there are or what they're doing We use rabbit MQ for a lot of the message handling But we've got it everywhere and if we need to do a note evacuation We need to migrate that service to another machine. We just kill rabbit off Any messages running on that machine That were either in the local queue or we're just in flight. They get dropped on the floor It's our Opinion that any network traffic handler of any kind should Anticipate that certain things are just never going to make it to their destinations and handle it gracefully At the time we were doing our upgrade from Grizzly to ice house Oslo messaging did not do that The reason we ran into this and had a hard time finding it is because Nova Grizzly actually did handle that fine And so we had been relying on the behavior from Nova in Grizzly and not expecting that it would regress in Oslo messaging this Again, this is sort of the grand prize of the conversation. This is the big deal Oslo messaging took more developer hours to Find it fix it and try to get it upstreamed, which we didn't end up doing Then any other single thing we had to do from Grizzly to ice house. This was the big boondoggle And we actually we did try to get the the fix moved back into Oslo messaging And actually the argument over whether or not our fix was valid took so long at the deadline for patches to hit ice house expired during the conversation, so Couldn't quite get that in there this was it this was the big deal Now I mentioned the Oslo to Oslo Sphinx naming thing in conjunction with this and People may or may not have heard the the I think it's Phil Carlton May have said it that there are two hard problems in computer science cash and validation naming things and off by one errors We didn't have any cash from validation problems And we did we didn't have our own naming problems the Oslo Sphinx was one But if you heard me earlier list off the six open stack services that we upgraded You may have noticed that the last one I said was quantum When we released piston 3.0 based on Grizzly it included a service called quantum when Quantum was renamed to neutron during the Havana development cycle We didn't necessarily absorb that change right away several reasons for that one of which is that our release schedule is just not as Rapid and iterative as could accommodate that and Second of all it was working for us So, you know why change it? Eventually when 3.0 did end up shipping when we had to massage certain things especially behaviors in Horizon we ended up with the quantum service the neutron client and Some special error handling in Horizon because there were quantum sir quantum client exceptions and neutron client exceptions Depending upon which code path things were going through and no they didn't subclass each other So what did we do what do we need how do we get neutron service? Installed we punt that down the road and worry about the upgrade to Havana ticket Which is now the upgrade to ice house ticket, which is again one engineer two weeks So quantum needs to become neutron We have to tease out all of those exception differences that we found in Horizon We need to put the new service in place. We need to remove the old quantum service all fairly straightforward. Nothing too tricky there One thing we did learn about this though That we could apply to future development was that our entire QA and CI system Expected there to be a service called quantum It did not know anything about neutron and moreover it expected that that quantum service would be there all the time So when we switched from quantum to neutron we had to essentially add some flexibility into those two systems To be able to accommodate clouds that are this old and have quantum clouds that are newer and have neutron And that was just something we hadn't anticipated so it took more time We have at this point gone way beyond one engineer and two weeks But I will continue to beat myself up over that for probably the rest of my career As and Chris approves of that so that gives us the neutron service But what do we do about the neutron network in 3.0 with our quantum service our default network that we Distributed to our customers who were of course free to use whatever plug in they wanted to was open v-switch And the reason we ended up with open v-switch is because again to sort of belabor the point of every service on every node That's how a piston cloud works We needed network service that could be distributed in that model and During the time of the fulsome to grizzly development cycle We had read a blueprint upstream about multi-host open v-switch and this sounded perfect to us. We liked it We liked it quite a bit unfortunately the Decision was made to punt the work of that off onto the Havana release cycle and we were trying to release a product based on grizzly So we took the blueprint for multi-host OBS and we cobbled together our own version of it It worked it functioned it met the you know specifications And our assumption was once Havana lands and we upgrade to Havana because that ticket still existed We could take the upstream multi-host OBS swap it in get rid of our piece and everything would be copacetic So our grizzly 3.0 product went out with multi-host OBS Unfortunately the work for actually making multi-host OBS work in that Havana time frame didn't meet up with where we needed It to be when we were developing 3.5 and then again skipping over Havana and going into ice house We did a lot of investigation Contemplated just forward porting all of this existing sort of you know non standard non cannon OBS code and ultimately decided to just scrap it and This was where we could take advantage of one sort of beneficial side effect of We have where we had at the time we've continued to have For over a year now had one developer more or less entirely focused on open-contrail contribution and Open-contrail could easily could I want to say easily, but open-contrail could replace OBS as our default networking plugin for neutron and You know we have a dedicated resource to it. We already have you know Community involvement we've worked very closely with Juniper on a lot of things in that regard So we were pretty confident that we could make this work There were Some idiosyncrasies when we had initially done our contrail development work for our 3.0 product grizzly based Open-contrail was focused on Havana So we had actually done a lot of back porting Havana code to work in our grizzly system And now we were looking at potentially forward porting certain things into ice house because contrail wasn't yet ready for ice house In fact, they told us that they wouldn't be ready until July this at this point this one Engineer two-week endeavor has gotten to about the April-May timeframe and we wanted to release in June So July for open contrail meant that this was going to open contrail wasn't going to work any better than open v-switch would However, we had the relationship with Juniper we had our dedicated resource we had May the decision to get rid of OBS So we decided to stick with contrail and as it turned out the timing didn't matter at all because if Any of you have followed along with the piston release cycle 3.5 actually shipped in September just two months ago Which is after July for those of you who are calendar challenged and we didn't have to worry about that timing problem at all Why did it take until September because of everything I've already spoken about and a couple of more things that I haven't covered yet So now we've gotten quantum to neutron we have five more open-stack services We've gotten through the migrations and the dependencies all we have to do is get those five services up and running and we have upgraded grizzly to ice house As you can probably tell we are in the home stretch here But we're not out of the woods. I'm mixing metaphors like a professional Five more open-stack services glance Glance is my favorite open-stack service. I'll tell you exactly why and you can probably guess We had no problem whatsoever upgrading grizzly ice grizzly to ice house using glance I Gave this slide a cat eye specifically for mr. Mark Washenberger who is my personal hero in this regard Cinder cinder almost as good as glance we had exactly one problem with cinder and Frankly, it was our own fault the Logic of the piston cloud boot system is such that when it is bringing a cluster up and running into becoming a cloud It makes a whole bunch of assumptions about which things should happen in what order. This is based on you know our experience and but it's also based on some Cavalier choices and in the pre 3.5 land We had made the decision to configure cinder before we brought up the RBDs that it was aware of and Grizzly cinder was extremely forgiving about this. It would just Accept its configuration and run and then as soon as we needed something and it was actually available Everything worked ice house cinder much more strict validates its configuration immediately and if we didn't have our RBDs up Cinder fell on the floor so Cinder made us be a little more honest in our cloud boot process now. We provision our disks and then we bring up cinder horizon also pretty functionally complete pretty hands-off as far as Any changes that we needed to make or any difficulties that we had really In 3.0 in our grizzly release. We had done an extensive amount of work on the dashboard itself. We tend to have the belief that a non-responsive web dashboard in 2014 is kind of ridiculous. It's sort of you know derigure. It should just be there So we had done a lot of work on the dashboard to make it appealing to the eye and responsive to the user more intuitive as far as the flow of actions were concerned and So when it came time to go grizzly to ice house mostly what we had to do was just bring it up to speed The benefit of the fact that it was easy and the fact that everything else was really hard was that we have a couple of really uncompromising UX people at piston and They got to basically spend a lot of that time really fine-tuning what they had done in 3.0 To make the piston 3.5 dashboard in my entirely biased and utterly subjective opinion To be pretty much the best-looking dashboard in open stack and really what all open stack dashboards should look like Keystone There's an engineer back in San Francisco He's a friend of mine About a year and a half ago. He got a job with piston right out of college and his very first assignment was to make Multi-region with a federated back-end work in grizzly Keystone and Oh, by the way, you've got about a month to do this and he's never touched anything authentication before I Think he still hates me for it, but I think he hates Chris Moore He dove right in he read every blueprint he could find he basically lived in the IRC channel lurked everywhere Termi would stop by our office. He would pick his brain on various topics He he went zero to expert in authentication and authorization. I was it was amazing to watch Ultimately what he did for grizzly and our 3.0 product was to put together a federated back-end based on those blueprints based on the feedback that he'd gotten but Sort of filling in the gaps of things that weren't already there in grizzly and then when he was done with that He said never make me touch Keystone again, please or I will die Guess who got assigned fixing Keystone for our grizzly to ice house upgrade Essentially, we we had to tell him we need you to take everything you just did tear it all out and rebuild it for ice house We ended up making some different design decisions the second time around in how to handle federation Yeah, we actually we sent him to Atlanta to the design summit and he's you know followed the entire Keystone track And he came back thinking I know exactly how we want to do federation and then we had some design discussions and We had to break his heart. So he soldiered through this and His reward for finishing that was getting an even more nasty assignment after that that is not part of this discussion But he's a trooper Keystone finally did get put together and our Multi-region multi-driver back-end worked out just fine Which brings us to Nova and Nova is obviously the service without which there are no clouds, right? Nova is you know the heart and soul of this whole thing and Obviously near and dear to to folks at piston and to all of us here Nova was generally well behaved for us in and of itself The the thing that we ran into with Nova essentially is Maybe piston specific or at least specific to a subset of use cases of open stack of which piston is warm and that is that all of our migrations and all of our VM operations are done with these live production workloads running clusters customer data is in play and We need to have these things work automatically without downtime and that includes migrating VMs from machine to machine It includes things like what we were trying to do with this upgrade where we were actually Trying to freeze VMs with Libvert upgrade Q mu in place and then restart the VMs Except now Q mu was offering a different ROM version Then the one that the VM had registered in its XML and the fact that apparently Nova nor anything else could really Edit that XML on a live running VM Combined with the fact that some of that XML was generated by system config and if system config changed and decided that it Wanted to give different information it couldn't update the XML and it was a very large knot of XML craziness That we needed to untangle to be able to get these VMs to behave the way we wanted to We really wanted to get the upgraded Q mu in place and that was Subject of quite a lot of discussion quite a lot of hair pulling People pulling their own hair not each other's To figure out how to make that work and ultimately we had to do what I've already said we prefer not to do We don't like to pin dependencies. We don't like to tie ourselves to specific versions of things but in the case of Nova what we ended up needing to do was We can freeze the VM we can upgrade Q mu But we need to pin the ROM versions within Q mu to the ones that we already have that way when the VM comes back up It doesn't fall on the floor What does that mean that means we're punting that decision that we're punting that problem down the road just as we had Previously done with grizzly to Havana grizzly to ice house decisions. We'll worry about that when Havana lands We'll worry about that when ice house lands. Well, this is an issue where we're gonna worry about the ROM versions When we worry about the ROM versions for now, we'll pin them everything else about Nova will work We'll get the new Q mu we'll get the new Lib Burton cobalt and everything Which brings me to my very last slide I Said at the beginning one engineer two weeks that was my estimate and When it comes time to deal with this problem that we have punted down the road one engineer me two weeks on Vacation That'll be it Thank you very much. That's our upgrade from grizzly to ice house odyssey Hope that it was enlightening in a great amount of shudden Frida for you all Thank you for coming