 Good morning everyone Yeah, welcome to our session upgrade from open stack, you know to meet Haka. So this is a experience report what we did for our cloud open telecom cloud a Joint venture together with a wall way that we are doing a public cloud run in Germany Yeah, and we recently upgraded from Juno to meet Haka and And yeah, we want to tell you a little bit about What we we did here, but first some words about ourselves Okay, welcome everyone. So I'm Dennis good Cloud chief architect from Huawei Technologies. So during the past One years, especially the a few months ago. We've just worked signage Lee with the T systems to accomplish the online version upgrade from Juno to meet Haka that is covering Nearly 2000 servers spanning across the two cities. So it's along these, you know practices we want to share the our experiences and the methodologies of achieving a Service continuity that is expected from all the critical enterprise applications on top of this cloud Good, my name is Sebastian winner. I'm one of the architects behind open telecom cloud Working on this since now. Yeah, two and a half years building together with wall way our public cloud and Yeah, taking part of major things in the infrastructure and All the automation around it Good Let me show you a very short video for the beginning To show we are not alone here if that thing starts So this is actually a commercial that was run by by eBay on German television But as I saw it it perfectly fit it to what we wanted to achieve here So basically it's a pit stop at a hundred miles per hour so while driving changing the tires on the car that you are into and and Yeah, trying to achieve as much as Continuity and availability for the end customer without actually influencing your disservice Yeah, and I think The perfect timing for for changing your tires and we I think had the perfect timing for changing our open-stack versions So kudos to to eBay for making that very nice commercial that fitted so well in my presentation here Good So let's get started. What what are we talking about today? So? First giving you a bit of context So bear with me if we have one or two marketing slides just to set the overall scene and the context a starting point where we did we start about one year a bit more than one year ago in in 2016 when we put our public cloud live Why did we go directly to meet Haka and and what is the the update strategy behind and then some Things how to to avoid or minimize downtime? I think is the the better formula better phrasing for that and then also Customer communication and lessons learned what we did with it Yes, I said Actually, it's yeah, three marketing slides. So Sure, you will survive it Where did we come from so T systems as part of Deutsche Telecom most probably you're more familiar with the mobile Which is the the mobile business part We did in the past classic outsourcing and scale up cloud so our private cloud Recently we we moved also into hybrid cloud with our week cloud offering But what was missing from that whole puzzle was the the public cloud part and that's where we're open telecom cloud and our partnership here with Huawei came in That we put together a public cloud in the European market That really scales and is secure So offering something that is really Filling that gap between all these these big players That you have in the market like Amazon like Google like Azure But running on an open platform open stack Doing it in a secure way So that it's protected by by data privacy and then German legislation that is behind no third-party access From non-European countries to the the administrative back-end Affordable, so we are running even below Amazon pricing what we are offering there and As said open based on open stack with all the APIs that you want the need to have to to run a scalable cloud application on top of it Where did we start? launch seabit 2016 so one of the the big IT fairs in Germany was really a The least viable product I would say so have a a solid basis Just with an infrastructure as a service offering All the surrounding landscapes that you need to have to run it But but rather limited in the set of features, but a rock solid basis that we could start with So that was the the basis where we we are coming from and and what we had there in 2016 at at seabit Was the you know release that we started with if you compare that to the official upstream Release schedule by the time that we went live with it. You know was over already end of life At least from the the official upstream part of you But therefore, I mean we are partnering with with wall way and and we still have the the great opportunity to have a fully patched and managed Distribution that will receive security fixes even after that time Moving ahead what we we did now with OTC 2.0 The the open telecom cloud release to that we launched more or less one year after that for for seabit 2017 this year Was that we moved to meet Haka? As you can see here we we are closing down the gap to to the official release schedule Having these half year release cycle in in a service provider environment where you really have to focus on on reliability and stability and then Yeah, stable API basis for our customers it's always the trade-off between releasing something that is fully tested fully Compatible with what we have to trying to keep up with the schedule We are we are working hard to to narrow down that gap So that we we are moving closer to to the official release schedule and with the upcoming releases We will get much closer to to the the official Roadmap from from the open-stack community Good Preparations so what did we need to do? as we are As you saw from that puzzle slide a few slides ago. We are not The only ones in that huge Deutsche telecom environment and not the only cloud that is run by by our company so a The supporting systems around it the landscape around it is a a common and shared system. So identity management onboarding of customers the shop The the billing chain behind it. These are all services that are around our Island I would call it of open telecom cloud, but we have to build many bridges to these environments To make sure that your customers get onboarded that they can get on the cloud that we can Collect charge data records and at the end of the day print a bill to the customer So that my salary gets paid and and you last and goods and clearance So that is the important part. So and that also imposes a lot of complexity So if we are doing upgrades, we have to ensure compatibility to the systems before and the systems afterwards. So The dependency on third-party services is Something we can't neglect in in all the things that we are doing Then coordination with the vendor. I mean we we get update Packages and input that we we need to coordinate what needs to be done in in what order? Communicating to the customer. I mean it was not completely downtime free So the video was lying a bit. So we didn't do everything at full pace At some point in time there was an infrastructure reboot that we could not avoid and what did not want to avoid But also hardware and OS upgrades and and all These things that were happening around needs to be coordinated. So bottom line project planning project planning project planning so you need to have a fairly good idea what you want to do and you need to plan well ahead and That's what we did and we want to give you an insight how we did that Yeah, I mean either you have Chuck Norris Then he will just do it on your own on the fly while driving that car and changing the tires in one one step but if you do not have Chuck then you need a vendor or a supplier to to support you there and That's what we have with with wall way and Dennis you want to tell a bit about more about the details What happened under the hood? Okay, so quite honored to be playing the row as a chuck to Face the challenges and fix the problem. So basically as as you know that the major fundamental methodologies we've been prepared for the seamless upgrading from Juno to me talk about this a great, you know Substantial software changes between the two major versions is first of all via the Delivers of a complete packages of upgrading automation tools what we call as a fusion update This is this is a tooling set, which is basically composed both agent lists Ansible based automation framework that is extended with a series of Functional extensions as well as automation scripts to enable the Open stack over-the-top service upgrades as well as a agent based Open stack and host hypervisor upgrading systems, which we call as a fusion fusion deployed and upgrading toolings For to facilitate The upgrade stuffs, which is more closer to the to the final granularity control of the operating system layer and hypervisor Configuration options and of course is there's a Regression testing that is conducted in the so-called mirror mirror environment, which is a very replicated and very same Configurations with the production one so that we can do all the comprehensive great test and functional test In the mirror environment before it is finally shifted and activated in the production and also the the so-called security updates and With the hotfix Technologies, which is making sure that all the data plane and hyper hypervisor level upgrades is not affecting Bringing any service discontinuity during the whole upgrading procedure. This is also very crucial You know in comparisons to the normal patching mechanisms that that will normally incur the rebooting of the systems which will bring disruptive experiences to negative experiences to the to the and And and tenants and customers and of course finally here's the interesting here's that what we bring to the OTC cloud is the cascading open stack that is which is spreading across two cities One beer and one map work two cities with each of the cities with a standalone Deployment open stack instance while we have a cascading layer setting on top of the two stand-alone Open stack in order to enable better scalabilities as well as the unified API exposures across the two availability zones and Even during the whole upgrading procedures. There's two independent availability zone open stack deployment Instances can be upgraded You know fully decoupled since the cascading layer is talking with the cascaded and native open stack by means of a standardized RESTful API and Regarding the fundamental methodologies of accomplishing the challenging task of seamless upgrading the Over 2,000 servers of configurates physical service configuration Cloud environments within two days actually more specifically two nights So we we need to carefully, you know planning the decoupled Upgrading steps that is actually specifically, you know Separating the upgrading steps in three layers horizontal layers. That is the service plane layers made up of a set of service consoles IAS and PES diversified services setting on top of open stack as well as the open stack controller itself that is covering both cascading and cascaded open stack Together with the relevant open stack agents running on top of the compute nodes distributed compute nodes and also the data plane here the data plane where we consider is most critical In terms of again, you know ensuring the service continuity Experiences from the tenant perspective Since it's you know, all the tenant Real-time workload mission-critical workloads is running on top of these data plane hypervisors software defined storages as well as the software routers and Distributed switches, you know, we need to guarantee that this you know upgrading procedures is Conducted in a very seamless way or in other words the upgrading procedures of the data plane Should be as less frequently as possible or if the upgrade is inevitable We still need to further on guarantee that we To leverage the advanced the technologies like hard fix or hard replacement rather than doing the simple, you know cold rebooting based patching So that the the whole, you know data plane is kept running without Being the necessarily a rebooting to enable the the whole upgrades Yeah, and also we we need to leverage the clustering mechanisms since as you know the distributed compute and storage nodes is Spanning to hundreds of even thousands of nodes. So we need to divide the data plane into several clusters Several clusters or fault domains. We need to conducts the rolling upgrades of data plane You know cluster by cluster rather than all in one time So this is the basic Methodologies we've you know taken especially so the first day we accomplish all the data plane and service plane components upgrades Without affecting the data plane, you know the fact that all the tenant workloads is kept running seamlessly Along the first nice upgrading of all the relevant, you know service components as well as opens that controllers and agents and The the following days the following two days we accomplish it all the you know a rolling upgrades cluster by cluster normally we we organize the clustering of compute and distributed storage nodes into Typically a 50 50 servers 50 to 200 servers as a one rolling upgrade batching. Yeah to accomplish it then You know gradually Okay And regarding the regression test we think it is also very crucial Since we believe that the the internal, you know test verifications Or integration verification is not enough since in the real production environment We're encountered with the further challenges of being integrated with the JSON the BSS subsystems as well as the VPN and the public networking Connectivities and configurations. So all these adjacent systems impact into the online systems of public cloud We'll need to be fully verify with the full stack over software configuration So we this consideration in mind so after the you know a passing of the the gate control Criteriors from the development environment to the production environment. We will firstly conduct a series of you know service deployments a great upgrade Testings as well as the full functional automated testings of the full functionalities that is deployed in the so-called pre production environment so that environment is having the very same configurations comprehensive configurations compared to the production one so if only if everything is Accomplish it very successfully with the by monitoring the relevant, you know logs even Manual inspections we will proceed with the next steps of Upgrading rolling upgrading the real production environment to make it a full active and onboarding Yeah, and if there's any during the procedures of Automated test for production. There's something wrong and not recoverable then we will do a rollback To the previous versions for the production one and of course that rollback could be even Conducted for decoupled micro services Rather than everything as a whole package So that the rollback procedures and impacts will be minimized as far as possible and with regard to the complexities and repeated human Configurations and low-efficient actions here. We introduce it the Ansible mechanisms to enable the automated configurations and onboarding of The both the application as well as the components and the infrastructures especially, you know on the targeted Host dedicated host or host groups Yeah, so here's the especially is based on the fundamental Ansible engines We extended a series of commonly reusable functionalities like package management configuration environment account and note man operation log management this kind of extended functionalities In extra to the fundamental Functions available in the Ansible framework so that they can be invoked by the playbook of the orchestration Strips that is executed to enable the higher layer, you know sequential and workflow editing for the daily Upgrading procedure. Yeah, and in terms of minimizing the service interruptions during the whole upgrading procedures Here we basically categorized first firstly addressing the minor version changes Within the open stack controller and so relevant agents Especially with the prerequisite of the RPC interface compatibilities or You know even not changing between all the relevant components within the upgraded services and the basic methodology here to Guarantee the smoothness of the service layer Upgrades is by putting a HA proxy in front of the API servers Multiple instance of the API servers take an example here We just select one of the instance of API servers to be upgraded from version one to version two while the HA proxies will guarantee that All the remaining APIs Will be isolated so that before it's the newly activated Versions is becoming active Yeah, so with this means these The the gradual or gray upgrades or rolling upgrade is enabled in the control layers to minimize the provisioning impacts during the service components upgrading procedure and also there's another cases with the major version Changes or upgrades where? Within the scenarios, there's no possibilities for the leveraging the open stack say for example, Nova Minor version compatibilities and self negotiation mechanisms in that cases only the you know Replacement upgrade is possible with that, you know consideration in mind where we need to accomplish a set of you know stopping the the service instances and then restarting the booting that the newly version upgraded versions correspondingly around With around 30 to 40 minutes of service interruptions in total for these are replacement based the major conversion changes This is this, you know Actually with regard to Juno to Mikata versions most of the components is relevant to these major version changes Procedures especially featured by the incompatibility versions of the RPC of the relevant, you know connectors and schedulers and authentication modules Okay next So we regard to the minimized configurations is most mission critical and service continuity sensitive You know hypervisor and data plane upgrades, of course the best approach is You know especially commonly used in the virtualization and private cloud Approaches is the live migration But considering the there is a still, you know some limitations of the live migration since the the live migration will firstly requiring you some You know redundants available resources as well as there's you know Lots of limitations when you're doing live migration across different flavors of VM instances And also the you will have even some failure rates As we observe in our live environment the live migration failures might be three to four percent That is translated into 97 average live migration Live migration successful rates and also especially here these with the line migration in order to guarantee the successful rate we need to According to since we are using Zen hypervisors in the incumbent environment With the PV driver, we need to have a fine-tuning and Troubleshooting a PV driver rather than the open source PV ops driver to guarantee a higher successful rate of live migration so that live migration as we Suggested it will not be the primary or first the choice For doing the the hypervisor and data plane upgrades our suggestion will be rather than a hot fix solution especially enabling the so-called function replacement by means of Instruction, you know jumping from a local code segment from the original one and then Accomplish the replaced hot function segment and jump back to the recovered instruction sets This is the general mechanisms. We were proposing to ensure the similar the seamless You know data plane upgrades of the whole procedures And this hot fix technologies successfully in enable the reduction of the necessary reboots by up to 80% and This is just the a pros and cons analysis of the host fix Technologies in terms of its pre-requisites and relevant limitations Especially it is of course only You know limited to the you know function changes rather than the whole process changes and also it's It's only a cold segment changes instead of you know the You know long jump You know changes in the in the future roadmap We will support the whole process in a replacement in the user spaces so that the hot fix can be applicable To not only security patches or some minor changes in the kernels and within the functions It will be applicable to a broader cases to enable a more smooth Replacement on the fly in the in the data plane. Yeah Okay And of course, there's a lot of cases which is not yet You know covered capable of being covered fully by the hot fixes being aware of these situations We still need some other means of Technologies and mechanisms to minimize the service interruptions during the whole data plane upgrade procedure like the live migrations do and shut down the virtual machine locally and rebooting that Or based on the shared storages of a VMH a To rebooting that systems The virtual machine instances without rebooting the whole host OS hypervisors so in comparisons to the local reboot of the Host OS and the VM the VMH a will enable a shorter service interruptions So it will be the and also the applicability of these VMH a is much broader also and interruptions, you know SLA is even better than local VM reboot and shut down so that's the You know the primary recommended situations where Hotfix is not applicable. Yeah only when these With the special concerns of the, you know, high-end customers for live migration Especially there's an even zero near nearly zero interruption Service level is required for some high-end enterprise customers Then we will suggest the introduction of a live migration as a as an optional but rather than ubiquitous suggested solutions for the for the, you know downtime minimization and Finally is regarding the data persistence layers, you know service, you know interruption minimizations especially considering the software defined distributed storages here we in in Huawei powered public cloud solutions. We're using the DHT distributed hashing table or the fully distributed client side, you know storage clients agents interconnected with the distributed storage back end so that in order to guarantee the Continuous data access of the shared storages for all the VM clusters. We also introduce a hotfix similar hotfix mechanisms in the client side distributed agents so that the accessing to the back-end storages will not be interrupted during the whole upgrading procedure Well with regard to the persistent the data volume itself as can be seen with these DHT Mechanisms of a distributed decentralized architecture all the single virtual machine volumes are being partitioned Into multiple copies across different, you know, four domains or different Isolated the four domain server clusters. So once we conducts the rolling upgrade of the the back-end Storage server side clustering's, you know You know batch batch clusters clusters by clusters rather than You know All in in one time so we can guarantee that the multiple copies corresponding to the same partitions of the volumes will not be rebooting or not available at the very same time So that the service continuity of the persistent data layers can be finally fully guaranteed. Yeah good Then talking a bit about customer communication Not as much as technical detailed as as that but not less important as that So talking to the customer is really one of the big things that you need to do if you want to to be successful there Telling the customer what is happening enables them to understand what what you are doing and what they need to Take into account if they want to to run their business Successfully so one part of it is is doing proactive communication So we are planning to upgrade the service then and that there will be an interruption expected to be around 30 to 40 minutes in that time frame Make sure that all your VMs are running in a secure mode and that you are Distributed across the availability zones so that you have a a scaled application that is able to to bear with that downtime that may come But also on the other hand side, I mean doing all these preparations Did not prevent us from running into some problems and bucks and and also Telling that to the customer what has happened and why did it happen? Is one important part of doing all that and I mean you know all the channels like email like blocks like these social media To you to communicate to your customers. That is really the important part that you need to do Yeah, and talking about problems what we had let me tell you a bit about what problems we faced During these upgrades that we had So one point scaling is really an important factor. So as you saw before we did test that in for staging environments before actually putting in on production and Still we ran into a scaling problem That we had a race condition between neutron pod generation and VM creations So it ended up that VMs got created faster than neutron could create ports So having the machine coming up with no network and As we are running cloud in it in these virtual machines guess what has happened. It thought hey I can't talk to my metadata service. I am a new provisioned machine and it's did what it was designed for it did a cloud in it and Yes, so we ended up with a new you you ID and a new SSH host key and Look at that from a customer perspective. I want to SSH into my machine and there's a different SSH host key So that is a disaster But telling the customer hey look this is the problem That's why we run into that problem if you face that problem for that one single time consider it as secure because it is a problem on our side and We we fixed it for the Second availability zone so that we did not run into that problem again But really dare the point talk open to your customers. I mean you are so Murphy is really a yeah a bad whatever so He gets his right Always and and that was the the one thing that we we were running intro as you had yesterday in the keynotes The demo got had it sacrificed yesterday in the keynote. So that was the one thing that happened Yes, and it also hit us but talk about it and and yeah fix it One other thing is we discovered a last-minute buck also in neutron Where a resource clean up would clean up more than it should be should be cleaning up So it could accidentally create a wrong we nick or too much we nicks Across a customer. So one customer deletes a machine and it will affect another customer and that Yeah, really made us do a full stop That was on the night or in the evening before we wanted to do the roll out in in the night and Going back to that whole delivery chain Detecting a buck and being brave enough to say no I do not do the upgrade even though I communicated that to the customer that tonight There will be an upgrade and telling them. Hey look it will happen in a week once we did a root cause fix that buck put it back upstream and Yeah, retested it in the full regression chain and then roll it out with confidence that this will not happen Yes, some further words what we discovered in that whole process during the upgrade Yeah, so besides the The fundamental issues that is being identified during the preparation of upgrades We've also identified a series of concurrency and raising Problems that is existing within the communities of mitaka. That is here. We just take a few examples Yes, you know as you the the typical configurations of our public cloud environments is Be being Installed with the cost customer base of over 1,000 physical servers within each availability zones Which is salad seldomly encountered within a private cloud scenarios So that the taking example here's the lack of instance of both the metadata instance uuid of a nova service as well as the lack of port ID Secure Akamiz database the the indexing of a multidimensional querying You know fuse is lacking so that the the relevant, you know query batch mode query Performances will be degraded dramatically when we are doing the you know hundreds of thousands of a VM queries batch queries on top of the basis of Installation base of tens of thousands of virtual machines especially for the concurrent rebooting and provisioning scenarios and Also the host managers and another examples to raise the deleted instances informations on startup the the host manager is a feature that is introduced in the kilo releases to offload the DB queries for the filtering filtering schedulings so this mechanism is also you know here's we we identify a box in this in the host managers optimizations to enable To to to delete the to filtering out the deleted instances informations to guarantee the performances and similar situations being identified take an example here says the DV DVR mechanisms where the air three agents is being scheduled and Unbundlings between the each of the compute nodes and air three agent which is only Should be applicable to centralized routing rather than DVR routing Yeah, many others, you know Examples here's like the recalculate numers on consuming the actual instances from the selected host to reduce the racing windows for the new ma and you know hardware hardware and Specific Resource scheduling yeah Good so closing down with a few words. So what did meet Haka bring us really a lot of new features? bare metal DNS as a service seat orchestration really a lot of things that we we improved our Overall portfolio. So if you compare that what we we saw at the beginning So really we we stacked up a lot of new features that we have now available on open telecom cloud We are down at the marketplace booth before if you want to have a talk to us And have further questions to our offering Just step by the the the booth and and we are happy to to talk and explain more details to you So thanks for your attention Yeah, and if there are further questions, I think Dennis will be around for a few minutes I have to run to my next talk. So if you want to learn also a bit about security and doing cross-border Yeah business with OpenStack in the US in the European Union in Germany. So just follow me to the next talk. Thank you