 Okay, we'll go ahead and get started. Thanks everyone for coming today. Today we're going to be talking about backport or upgrade decisions and unchaining OpenStack releases. And please hold your questions till the end. We'll have about 5 or 10 minutes for questions at the end of the presentation. For those of you who had been looking at the schedule, Chen Mainaik was supposed to be presenting with us, but unfortunately he couldn't make it today. But he was very instrumental in coming up with the slides and the ideas that we're going to present. So, I am Brad Paporni. I'm a principal software engineer with Symantec. And I primarily work on Horizon, but also work on some of the other OpenStack components. And this is my colleague, Bed Ladz. Hi everyone. I'm Bed. I'm also a principal software engineer at Symantec. And I work primarily on the platform as a service solutions. But I've also worked on Keystone, Nova, and Horizon in the past. And I'll be doing the first half of our presentation today. Here's our agenda. I'm first going to talk about keeping up with OpenStack releases and how at Symantec our approach in doing so has evolved over time. Then I'll describe some Keystone use cases and how we went about solving them through upgrades. Brad will then cover some interesting Nova and Horizon use cases and finally he'll recap our lessons learned. Keeping up with OpenStack can sometimes feel like running on a treadmill that's going at 100 miles per hour. Especially if your organization plans to upgrade every time there's a new release, which is six months in OpenStack's case. Add to that the fact that bug fixes and stability support are only available for two releases behind the latest and things can feel a bit crazy. Now when we first started out our cloud at Symantec, we were aware of these facts, but we weren't quite aware of the cost of the upgrade itself. So let me walk you through the story of our cloud to get a better idea of what I'm talking about. It all started with a big bang in the summer of 2014 when we deployed our first Havana production cluster in one of our data centers. The edge here in the diagram stands for Havana. At that time, even though there was a new release of OpenStack, which was Icehouse at that time, we decided to stick with Havana because we thought it would just be more stable. And as you can see there, we didn't have a Horizon back then because we wanted our customers who are internal Symantec business units and developers to work with just the APIs and CLIs, CLI clients. But we later realized that that wasn't such a good idea. But going forward from here, we had two main options. Either we forked the Havana code base and maintained our own repo with our own features and our own bug fixes, or we synced up with the community code base by upgrading frequently to keep up and by keeping our custom code changes to a minimum and keeping them portable. We went with the latter option because we wanted to get newer bug fixes and features automatically with each upgrade. We also decided to stay two releases behind the latest or two releases behind trunk for stability reasons. So this meant upgrading every six months whenever there was a new release. And this happened usually at the time of every OpenStack summit. And in our case, it was just six months after we went live. So six months passed by and it was time to upgrade every single service to Icehouse. So the way we went about this was we had from our team one software engineer assigned to each service with the task of developing an upgrade plan, a detailed upgrade plan, listing all steps needed such as taking the API down, executing database migration scripts, and so on, and accounting for different kinds of scenarios such as failure and rollback. We then had a couple of infrastructure engineers compile all these plans and execute them in a coordinated fashion. And a little before that, we also got our first Horizon dashboard, which our customers absolutely loved. No brainer, right? So after a fairly tedious upgrade process, we were successful and we reaped the benefits of a new release. If you want to know more details on how we actually did the upgrade and how we executed the plan within 10 minutes, I encourage you to attend a related talk that's going to happen on Thursday in this very same room at 11 a.m. by our very own engineers, Preeti and Gabriel. So moving on, even though the upgrade was successful, we realized that there are some hard costs involved, mostly in the form of time spent or misused. Firstly, there's an engineer time spent on coming up with the upgrade plan itself. So if we had about seven engineers working 40 hours a week, assuming it would take a sprint to come up with the plan, which is around two weeks, that's about 560 engineering hours gone in a short time frame. You probably don't want that just in case to handle emergency bug fixes or instabilities in the cloud itself. The next cost was time spent making our custom code changes compatible with the newer release, and we also spent a significant amount of time testing. Now, we at Symantec believe that if you want to go to a newer version, you want to give your customers at least the same experience or a better experience. You don't want to go back on that. So we spent a significant amount of time testing the user experience part, along with executing Tempest tests and manual workflow-based testing. We also spent a lot of time writing puppet scripts to be compatible with the newer release, and we also updated our deployment scripts. Most importantly, we realized that there will be a scheduled downtime associated with your upgrade. And this can be critical if at the beginning of the year your company or your organization sets a goal, an availability goal, such as for 2015, I won the availability of my cloud to be 99.999%. Well, if it's five nines, then it only allows for five minutes of downtime for the entire year. So this experience taught us a lot, but going forward, we decided to rethink our strategy. We decided that we will upgrade our back port on a per-component basis only when we really need to. And we will be on the lookout. We'll keep our eyes and ears open, whether it be through IRC, Launchpad, or attending design sessions for changes in the community code that could be of immediate benefit to us. So we've come up with this flow chart that kind of summarizes our decision-making process that we have followed till date. Starting from the top left, basically we always ask ourselves, for the feature or fix that we want, is there an immediate need? If the answer is no, then we try our best to get the feature into the community code based by working with the OpenStack community. And if there is an immediate need, what we do is we check if there is a solution in a newer version of the service. If it's not there, then we ask ourselves, can we make our change modular and portable? If we can do that, then what we do is we are forced to make invasive changes to our code base. And during upgrades, we'll have to spend a lot of time porting them. But we try to make our custom code as portable as possible, so that makes it easier to move to port between upgrades. Now, going back to the second diamond out there, if the solution does exist in a newer version, then we follow a completely different strategy. And I'll cover those paths in the following slides. So as I mentioned before, our strategy was to upgrade only when necessary and to be on the lookout for solutions in newer releases. So a cloud user base grew pretty fast, and we started onboarding complete business units onto a cloud. And our identity information started getting large and pretty disorganized. We had started out using the very basic dev stack model, where all the users and projects are stored under one default domain. And the users consist of both cloud users and service users. For those who don't know, service users are created per open stack service for token validation purposes. So the problem was, in our version of Keystone, we could only assign one identity back-end. And this meant storing the service users together with the corp users in one back-end. And this can create problems depending on what your corporate policy is, or maybe for audit and compliance reasons. We also wanted to isolate different business units projects by keeping them in different domains. So we kept on the latest developments in the Keystone world and found a solution in one release ahead of the one we were on. In Juneau, there was a completed feature called domain-specific back-ends. It's now called domain-specific config that allowed for separate identity back-end for separate domains. So we figured this was a really neat way to organize users and business units based on bi-domains. The corp users could be stored in one domain with an LDAP back-end, and the service users could be stored in MySQL. This is exactly what we wanted. If only we had Juneau. So we went back to the floor chart. There was an immediate need. We identified that a solution existed in the next release. And the change was pretty stable. And I mean this in two ways. First, it did not introduce any new bugs or instabilities within the system. Secondly, there weren't many uncompleted tasks related to this feature in Launchpad. So the next crucial thing we wanted to check was, was it API-compatible? And to tell you more about this, I dedicated a slide just for it. We know that OpenStack consists of a service-oriented architecture with different services talking to each other using REST APIs. So we want to make sure that maybe upgrade one specific service, that its API interface worked and responded the same way as it did before, before the upgrade. In our case, it was completely possible that Keystone had abandoned the old V2.0 API in favor of the V3. And this could have left our cloud with no way to validate tokens. If you look at the graph here, you see who talks to who. Horizon talks to every single service without anyone talking back. And you can see Keystone is exactly the opposite. So after testing, we... Sure, no problem. And I recommend you to make craps for your own systems like this, because it's really helpful for interaction testing. So we determined that, you know, it's API-compatible, and we went ahead and upgraded straight to Juno, which was the latest release at that time, and it was during February. Now, the problem of using the domain-specific backend feature that I just talked about was that every time we updated domain information, we had to change the Keystone config files and restart Keystone each time. Now, imagine onboarding business units onto your cloud pretty frequently, and imagine doing this for a lot of data centers. You can get pretty tedious. So, again, we looked at the latest developments in the Keystone world, and we found a solution. In Kilo, they added an API that you could call and store domain-specific config and persisted in the database. There was no need to restart Keystone, and config, which was stored in the database, could be easily migrated between upgrades. This was exactly what we were looking for. Here's the Launchpad link for the whole domain-specific backend feature, and the Get It link is there as well. So we followed the same exact decision flow as we did for the Juno upgrade, and we went straight to Kilo, which is the latest release, as of now. Then we ended up with a mix and match of different releases, working just as well as before, if not better. But one important thing we got out of this is that this approach made us more agile, and we did save a lot on the number of engineering hours lost during an upgrade in a short timeframe. Now I'm going to hand it over to Brad, who's going to talk about some really interesting Nova use cases. Thanks, Ved. So, Ved has talked to you about some of the... the decisions that we've made as far as making upgrades and how we've decided what was best for us, and then some specifics on Keystone. I'm now going to take us through some specific use cases for Nova, and then for Horizon, and then go through some of the lessons that we've learned along the way. For upgrading Nova, for anyone who runs a large cloud and has been through some Nova upgrades, you probably think about upgrading Nova and go, not again Nova. Things get so complicated as far as the dependencies that Nova has. It has a lot of moving parts, and in a large environment, you'll have a lot of compute nodes upgrade, as well as your controller nodes. And when you're upgrading compute nodes, you have to be very careful not to lose your existing customer's VMs, because that's a risk when you're changing dependencies underneath what those VMs are running on top of. It's also got dependencies on Neutron and on Rabbit, and so, for example, you need to be careful about flushing Rabbit queues out as you're going through the upgrade, or otherwise you could have part of the transaction handled with one release of Nova, and then the second half of the transaction with another release, and then you're going into a lot of trouble along the way. And so, due to the complexity of Nova, when making upgrade decisions about it, you need to take into account the high cost of upgrade, so a lot of planning, a lot of risk involved, and so the decision is a little bit different with Nova compared to some of the other open stack components. But these were some of the semantic use cases that we had for Nova, and at the time we were looking at these, we were running Ice House Nova, so we had some requirements for VM naming conventions, others for availability distribution scheduling, where we schedule VMs differently based on the project they're a part of, and based on the applications those VMs run. Then we also had some requirements for what we call class of service, and so based on the class of service that's in the project data, we also treat VMs differently. And finally, we had some config drive modifications to make for the way that VMs are named, specifically for DNS domains. And at this point, I just wanted to get a show of hands looking at these use cases, and for people who have used Nova in a production environment, how many in the audience have had to solve some of these use cases on their own? So it looks like a few out there. So these were our use cases, but they're also fairly common across the industry in the way you handle VMs. So when we looked at solving these use cases, we first looked at whether we could upgrade Nova to get these things. And looking into it, we actually found that Nova is very extensible in a lot of the right ways. And so you can often get what you need out of Nova without having to actually upgrade. So all of these use cases are actually supported in IceHouse and some of them before IceHouse with the hooks and the scheduler frameworks. And the hooks and the scheduler frameworks make it very easy to modularize some of your customizations in Nova and still be able to port them between releases. And this is a technical example of using the hooks framework. So you can see that you need to modify the setup.py and tell it where your hook class is, then define your hook class and then implement what it is going to happen with the hook when it gets invoked. And then finally decorate the methods that you want to invoke the hook classes when those methods are invoked. So all of those previous features, we actually implemented just from the hooks framework and the scheduler framework. So when we looked at this for Nova, we considered our use case and we again had an immediate need for a feature. It wasn't contained in the base Nova code, but through some research, we did find that Nova provides this modular way of doing what we needed. And so we went through and implemented what we needed and that allowed us to solve our use cases for our customers as far as VM naming and how VMs are handled, but we're still able to easily port those functions between releases. So now when we upgrade Nova later on, we'll just be using the same hooks and the scheduler modifications that we were using previously. So after those decisions about Nova, we decided to stay in Icehouse using the hooks and the scheduler frameworks. And as Ved mentioned, you have to be careful about API compatibility between releases, but through validation, we found that Nova Icehouse would work properly with Juno Keystone and even through Kilo Keystone. So at this point in our evolution, we were running with Kilo Keystone and then Icehouse for everything else and things working properly with each other. So for the near future, we do have an upgrade, we'll probably upgrade to Kilo for Nova, mainly just due to the desire to stay closer to where the community is going. For your own use case, you might not need that. You can fall behind further, of course, but that's where we are at this point. And one of the later things that might also convince us to upgrade Nova further is the Gantt scheduling. So once that's ready, we're upgrading for that as well. Next, I'm going to talk about some of our Horizon use cases and in this case, it's a study in working on trunk or even a little bit ahead of trunk. We use Keystone v3 in our environment and Ved has already talked about domains and projects in Keystone v3 and so our users need to be able to manage membership on their domains and their projects. And in order to do that in Keystone v3, you need domain scope tokens to modify these things. As of the Juno release, Horizon did not support working with domain scope tokens. And so for the Kilo release, we were working along with community making these changes to get domain scope token support into Horizon. Very important that our users need to be able to use these changes in Horizon since it makes it much easier than trying to use the CLIs for everything. So we were implementing these things in community in Kilo. Unfortunately, it turned out to be a lot of changes and pretty complex to get it in. So it didn't make the Kilo release, but our users still needed this support. So we ended up using the unmerged changes from community applications for what we needed in our environment. And also on this slide is a few links of what we did and how we did it for pulling the domain scope token support. And so for our decisions about Horizon in this case, again we had an immediate need for the feature. The feature was planned to be on the next release of Horizon into Kilo. But when we got to actually needing the changes, we saw the changes as not stable since they weren't merged yet. And so we needed some custom stability modifications for our own purposes and to validate those in our own environment. But then with those modifications we were able to backport then. We started out with Kilo Horizon and backported those changes along with our own modifications. And we were able to do what our users needed just using those, the unmerged patches from community. And that gives us what our users needed for managing projects and domains with our Keystone B3 install. So at this point we ended up going to Kilo Horizon but we're a little bit ahead actually also since we pulled in those domain scope token patches. And as you can see here we've now got a few different releases of things going on in our environment. And through planning and testing out things as we pulled them in we were able to validate that everything would work properly in our environment and so now we have a few releases running in harmony in our environment. So the next thing we have planned for the future and this is near future for us is upgrading Glance from Icehouse to Kilo. Similarly to what we're talking about with Horizon we had some, there are unmerged changes for Glance community images that we need. And so we'll be doing a similar thing here of pulling in unmerged changes, validating them, making any stability modifications that we need and then upgrading. So we've taken you through some of the use cases we've had and how we've made some decisions. And next I wanted to talk about some of the overall lessons we've learned here. One of the major things we found is to isolate all your components and this is easiest if you've got your components running in VMs or Docker or some kind of other easily extensible where you can be very flexible about where you're running hosts. So you can see previously we had Nova, Keystone and Glance all running on the same host but that made it very difficult to upgrade those components to different releases. When you're modifying dependencies underneath those services you can get yourself into a lot of trouble breaking dependencies between them. And so we switched to where we've got now Nova, Keystone and Glance running in separate VMs. So this is what decouples the releases and makes it possible for you to upgrade each one separately. We've also worked with unmerged changes from community and in this case we found that it's critical for your developers to be developing those patches that you need no matter who is working on those changes in community. And as part of that to continuously review as well. So this not only makes it so that your developers understand that code before you need to pull it in but of course it also benefits community as a whole. Not only putting fixes back that you found as those patches are being developed but then also providing comments in the reviews of those patch sets as they go on. So when working on those there's this continuous back and forth between pulling and testing the community repo changes and then testing them in your internal repo and then contributing back to community both in patch sets and also reviews as you go along. Efficient internal CI CD also helps a lot as these unmerged changes are being worked there's going to be new patch sets coming in and so it'll be much easier for you to get some validation on whether they work properly or if you can fold them into your own CI CD and automate the testing of those changes. Continuous internal testing is also very important so again pulling the patches as they come through even if this is just manual testing to validate they work in your environment very helpful to keep pulling them and to follow along with the patch sets as you work them in community. Ideally this would be automated testing something like Tempest where you can do some integration testing in your own environment but even manual testing very helpful and then plan to sync back up with community later so if you're making modifications for your own stability in your environment try to keep those changes as close to community as possible so that then when those patch sets do emerge in community ideally you can just upgrade to the latest community release and you have what you need automatically and another thing we found is that we need to avoid unnecessary upgrades as much as possible in a business environment your use cases are what drive your need for upgrades rather than always staying with community or always having to be on the latest release so when you're making upgrade decisions you need to balance the benefit that you get from those upgrades against the cost of going through those upgrades often backboarding can be cheaper in the short term but again you have to be careful backboarding that you at least want to stay close to community with what you're doing going forward so you don't stray too far in the long term and then this is based on your own needs but weigh the cost of falling far behind community so working with community releases from the past community only supports two stable branches between in the past and so you need to at least have your own branches in your own environment to be able to keep developing on a previous release and to have enough talented developers to support that previous release because you won't be able to go back to community if you're too far behind to pull the previous release of the code and so we've talked about how we've looked at decisions and made decisions and some of the use cases we've gone through for whether to upgrade in certain scenarios and then we've gone through some of what we've learned over time and we found that by planning and validating whether we can upgrade to different releases different components at different times we found that we can run with different releases of the different components in the environment and so it's not always a case of having to be on the treadmill but based on your use cases you can decouple the components and then run with them and upgrade only what you need and only the times that you need them so with that it's time to break the chains on these releases thank you and we have some time now for questions if you do have a question please step up to the microphone so everybody can hear properly Hi, I had actually a question about upgrading several versions at once we had a big issue when we upgraded from Grisly to Icehouse directly for example some of the component database migrations didn't take unless you actually ran a version in between and have you got any ideas how to solve these problems with your larger upgrades or like jumping several versions yeah that's it's an interesting problem trying to jump versions at a time we haven't actually done that in our environment yet so every time we upgraded we at least went one release at a time I'm not familiar with a way to actually go do the database migration from one release jumping two of them probably the safest thing would be just one at a time though at least for the database migrations yeah but that just that makes the upgrade also much more work because you actually have to start the services on an intermediate version always yeah right that's true also the content yeah we had the same with yeah we also have a new similar issues that it yeah thank you interesting so is this embraced by the vendors who are providing support for example Red Hat or anybody that this model of upgrades so for example after the upgrade there are any issues coming up I go to the vendor the first thing the vendor comes and says okay here's what I want you to be in line then only I can provide the support is this embraced by those vendors or not I don't know of any vendors who are supporting this in our case we run our own private cloud and so we have our own support through our own developers and our own infrastructure people and so for our environment we can sort of as long as we're familiar enough with what we're doing we can run on different levels maybe a bit harder if you're trying to stay completely with Red Hat support or another vendor I would want you to be completely on one release for everything nice presentation have you so certain components can be upgraded in advance of others like the Keystone example works pretty well and I think Horizon also has a statement where they try to be backwards compatible so that you can run a newer version of Horizon against older versions of the other stuff but for those other components like Ice House Nova, Cinder, Glance, things like that have you actually talked to those development communities to figure out if they could start publishing like backwards compatibility release notes type things like Ice House Nova is compatible with Keystone, Kilo, that type of thing and would it be what would make that easier for you and the community to figure out what the matrix of things that are backwards or forwards compatible with each other for our own environment I mentioned that we were currently working on a glance upgrade going to Kilo and we've done some testing with that and of course doing these things in a test environment beforehand to validate the APIs that you need work in your environment with the different releases that you have I don't know of any efforts that are going on in community to have like a matrix of which releases work with each one that really goes back to the API compatibility which releases can be spread across For instance, if you were to use some specific features in newer versions of things like Kilo for Net Tokens or something like that would Nova Ice House support for Net Tokens or would you have to upgrade the Keystone middleware also token stuff that's in Nova to then be a Kilo version of that so now you have a hybrid of everything That's something that's based on your own use case of what you need from each service and are the APIs for those use cases compatible between releases Very good talk Quick question, you've discussed API compatibility testing Can you give more details on that? Is that mainly tempest based or how did you test that the different versions of Horizon was compatible? It would be best to automate the testing In our environment we did it in a dev test environment so we have very similar to what we have in production and then we're able to install into that environment and kind of play around with things and again that goes back to what your use cases are so your customers may not need certain functions from Keystone or Glance but based on exactly what you need those are the things that would have to be validated to work together Right, we also incorporated workflow testing and that's like we test Nova through Horizon so we make sure that whatever component we're upgrading we test it through other services so that kind of ensures that they work as they did before That makes sense So in relation to the previous question about the Nova actually I heard in Paris that actually they're going to support like one version back compatibility and even for the upgrade time actually you can have like a mixed version of the compute so Nova might be newer and the computes might be still the older one and that should work just for the upgrade time so this is supported, okay? That's what I heard in Paris My question is actually about you mentioned something about the model that you guys have open stack services running inside the VMs, right? So do you for example doing this like a kind of in-place upgrade which is basically just like I don't know creating the new VM installing the new version of the open stack throwing away the old one Actually that's a very good question for our related talk that's happening on Thursday at 11 am But go ahead Because we're exactly looking in this direction, right? We don't really care about the old version which is basically create another version install the open stack service, whatever it is just shutting down the old, I mean just switching the HAProxy to this new version of the VM and then we can keep the old version and this gives us really easy rollback in case of any problems and the storage is cheap so we can keep it for I don't know half a year or a year nobody cares Yeah and that's one of the easy ways to make sure that things work properly Again the talk on Thursday will get more into the specifics on that Okay Thank you Thank you very much if you have any further questions just talk to the presenters outside Thank you