 Okay, let's start. So let me introduce my colleague Vladimir Kuklin. He is a principal deployment engineer and tech and fuel technical lead at Mirantis. And let me introduce Sergey Galvatyuk, our senior deployment engineer at Mirantis. And today I would like to tell you how we fought for open stack high availability. Apparently this battle is still going on, but we have some good results. So I would like to share the details how we achieved this while working on Mirantis open stack and fuel projects. Yes, all right guys. Some technical issues. Let me switch them. So it's not rotating. Let me once again. Okay guys. Sorry some technical issues. Oh, yeah, so. That was great. Okay, so before we start talking about open stack and its components individually let's define high availability and what kind of high availability this talk is related to. So high availability as you all you know is a characteristic of a system to attain high availability of service even if it is exposed to failures. So in this talk we are covering only a single failure of a component at a single point in time. We don't cover a false majority of events or uncommon physical destruction, not deterioration of hardware. Okay, here is a classical diagram of open stack. So what's the main problem of this? So the main problem that you should have high availability for all components. So in reality you should have at least two copies of each component. However, I would suggest to have at least three copies to avoid split-brain scenarios that may arise. So all components should have only healing mechanism or can be under arbitrary control such as pacemaker or zookeeper. So we are not covering no data center high availability in this talk. We're talking only about open stack. So what do we need to provide a chip for open stack? So first of all we need to provide high-available networking connectivity. In our project we have support for almost all types of bondings including Linux and open the switch. So for database we chose MySQL with Galera and we achieved really good results optimizing management of it by use of pacemaker. So MQP is also a crucial component of open stack architecture. So the death of one controller should not affect service availability at any time. So in order to achieve this we had to alter significantly deployment workflow of Rebit MQ along with open stack messaging code. And also we had some issues with memcached. It didn't work with no death of particular memcached instances, so we had to do some stuff about it. For storage availability one can choose, you know, proprietary solutions like EMC on ad-app or fault-tolerance software defined storage. We concentrated our efforts on sef and it's working pretty well. Also API services should be a lot balanced and they should redirect requests to alive nodes when something goes wrong and for Neutron you need to elaborate each solution that migrates all the entities from dead agents to alive agents and also for heat and cilometer you also need to ensure that agents can fail over safely from one controller to another controller. So Galera. Why Galera? Sometimes I hear from people that Galera is too complex and DRBD is an easy solution that is enough for my reliability. However, you cannot scale up really well with DRBD as you have only two instances. So what about master slave replication? Master slave replication is really good, well-known technology and there is no big difference between Galera and master slave replication in terms of open stack reliability. So here you could, so going back, so here you can see a classical diagram we have in fuel. So all services communicate with a database via HAProxy and HAProxy high availability is based on virtual IP which is moving from one to another HAProxy in case of problems. So however you can see that all communications go to one HAProxy. As you know, as was stated yesterday by Peter Boris from Percona and J-Pypes from Mirantis that open stack services have some problems as they use select for update statements and don't have a retry mechanism for failed transactions. So MySQL improvements, so we focused on MySQL and upgraded MySQL to 5.6.11. So that helped us to resolve many stability issues we have. Also we use the latest Galera module. So for Snapshot state transfer we moved from MySQL to we moved to extra backup and we got really good results as we we can do really large database synchronization during Snapshot state transfer or incremental transfer. For HAProxy we added simple checks via XNED to perform simple checks against database to determine where we have a database sync or donna or the sync state. So in case when we have donna or the sync state we HAProxy is not communicating to that server. So also we added MySQL Galera to base making control. However, our previous OCF script was not mature enough and had some technical issues and didn't reassemble cluster in many conditions. So before we wrote a completely new script that helped us to resolve many issues and doesn't require any human interactions to assemble Galera cluster. So how it works. So the general idea is to get the most recent data from all components and MySQL we get this data from from global transaction ID of each node and we save this data in base maker cluster information base. However, in case of problems we can get this data from grass state file and when we have all this data we can start primary component really easily with the most recent data. So however we also added OCF script as clone base so we don't need to create any primitives on new controllers. Next slide. It's Revit. Okay, so when we started working on Revit MQ cluster resilience we found that Revit MQ cluster behavior is not very obvious and clear and for instance when you restart Revit MQ do how to set and it just becomes very vulnerable to race conditions and there is not much things you can do about it. So what we did we wrote an OCF script that it is really complex and has really no complex logic and that utilizes base maker master slave resources and notification mechanism. So on this diagram you can see what it is doing. So when we do start of the servers we just start beam virtual machine. We do not do anything. Then when base maker relax master node for instance let it be node one we put this information to base maker database and start Revit application server. Then base maker sends notifications to all the other nodes and they join master node and also start the application. And our status function in this script is always checking beam Revit and whether all the nodes are clustered with the particular master. And if something goes wrong for example you cannot start application server for Revit MQ because of these race conditions and you know this application trying to connect nonexistent stuff we just reset it completely. Then speaking of Oslo messaging, initially we had HAProxy server requests for Revit MQ but it was not scalable solution. So we were very happy when Oslo messaging was merged and we could use several hosts in the Revit host parameter but the problem was that it was not working with connection failures. There was no support for hardware for MQP and connections were hanging for no, for a long time and nothing was happening. Consumers were not receiving messages, producers were not publishing messages. So we took first community heartbeat implementation and tried to use it. It was broken. We tried to use short kernel TCP keeper lives to kill connections in a minute after they became become idle. This also didn't help but the good news is that our guys finally elaborated the fix. It is already in RANDIS OpenStack and it is already on review on OpenStack.org. So if you think that you need Oslo messaging with heartbeat support please vote for it. So there is nothing interesting to say about high availability of OpenStack services. We use pretty standard way with active, active backend of HA proxy. So with active, active backend some of parameters are tuned to digest production load and we have pretty good results. However, we put some services under pacemaking control such as heat up, salometer and we moved HA proxy to separate name space using standard mechanism with VATH plus RP proxy to eliminate hand connections. So Keystone. We strongly believe that temporary data should be kept in key value storage. So and previously we switched Keystone to use memcached instead of MyScale and high availability of that solution was not really good enough. So and we decided to improve that solution. So we used standard Python memcached library which is broken as it which is broken and we tried to switch to Python LibemC library which is really good. However, it's written in C and it's not even let's say. So we decided to write a new driver and our developer, you wrote a new driver with pull of connection to memcached server with some logic to detect that servers. But however, we still have some problems with memcached as logic for choosing memcached server for sharding is still broken and we have fixed for it which is on review control. Okay. Speaking of Neutron. Well, there is nothing interesting about Neutron server which is under HAProxy. But the divorce in details when you try to manage Neutron agents safely. First of all, they leave a lot of artifacts like IP addresses, namespaces, running processes and so on. So now save scripts for Neutron agents. We needed to clean up all the stuff when we do start and stop actions. The second thing that you need to do is to reschedule entities like routers for three agents and Neutron networks for DHCP agents and well, eventually services for load balancers service which doesn't contain any opportunity to reschedule services right now. So our guys are working, so we were using API of Neutron because there are API methods to do this right now save script. But our guys are working on API handler and rescheduling mechanisms for Neutron. Currently there is some kind of automatic rescheduling code already but we tested it and actually it had problems with MQP failover. So in case of, it was not working in case of failover and this is really, no. It's a pity. So SAF, however in fuel we use pretty common way with SAF with monitor on controllers and we have OSD for each disk on SAF nodes. So however our developers worked hard to implement live migration and we added some code to open stack to have live migration. However we also added some code for object and image based storage sharing. So and right now SAF works pretty fine and all code is merged to open stack. So and now it comes to deployment. When you want to deploy all the aforementioned stuff, you know, in the cluster, it is not really easy to do. So we had a lot of things to write for our deployment workflow and these were, we need to, these were crossing place micro manifests originally created by puppet labs but we needed to fork it and add other stuff to there. So there is a chip proxy with configuration directory support to make it work in a granular manner along with manifest for puppet open stack modules and also custom OSD scripts for virtual IPs and the HA proxy to run them in a separate namespace to avoid problems that were mentioned earlier. So there are three slides about deployment. So what we needed to do, we wanted constrained support for puppet because we needed them. In order to run master slave and clone resources, we needed to write code to support them. And in order not to modify all the manifest, all the puppet manifest for open stack deployment, we also needed to provide service provider for pacemaker for puppet. So what it does right now, it parses local resource manager of pacemaker database, waits for status change, respects timeouts and monitor comments sent by pacemaker. And what is really good that in order to support it, you just need to change service provider and, you know, add some two or three stances to make vanilla puppet manifest work with open stack HA. And also in order to support really complicated OSD scripts like for rabbit MQN Galera, we needed to switch from default upstream shadow approach when you generate a copy of database of pacemaker configuration and then apply it again because it was overwriting attributes that are used by OSD scripts. So we moved to pacemaker XML patches and it is working pretty well. So as soon as we just got it deployable, we moved to stability. The problem was that our service provider was triggering actions globally. And, you know, it was affecting deployment stability. So we switched to pacemaker asymmetric cluster where everything is stopped by default. And when we do start of services we enable them by putting zero location constraint. Then in order to make all these actions finally local-wise, we also modify status method of provider to make it dependent on resource type. For clones we check if it is started locally and for primitive resources we check if it is started somewhere in the cluster. Also we had to tune SQL driver timeouts and all the other stuff. And also we needed to introduce short kill nodes to keep alive because there are still, we need to kill hanging connections as fast as possible, not after two hours. So speaking of HA scalability, after we won this stability fight we moved to another one. We wanted to easily scale our cluster with controllers. And this is what we faced. Most of our customers do not have multicast configures in their data centers. I don't know why it is 2014 but it is fact. And we had to switch to unicast by default. And because of this we had to restart core sync each time. We want to add new controller. And in order to do this we altered core sync manifest to put pacemaker into maintenance mode because this is the only way to do this. In zero version of pacemaker plugin. And also we had to add, we had to limit parallel controllers deployment because not to exhaust Galera, Galera donor nodes. So testing. Testing doesn't come at easy cost. So we implemented a special library which is tiny piece of libvirt based orchestration. So and we perform our destructive tests on daily basis using that library. So we spin up virtual environment with all services we need. And we start performing destructive tests. And it's integrated with our CI and we have results once we have any problems with HA. So we wrote special own framework which is fuel OSTF. However we also run tempest and rally. Okay so speaking of results what we currently supporting from HA's point of view. We handle single failure at a time. Either it is controller reset. Also if you reset all the controllers they reassemble automatically. It may be network partitioning, it may be failure of MQP, DB, database node or particular service. So all these test cases, all these use cases are covered. So our Sergey, this is your part. Okay so current plans and challenges. So we need to implement multiple rights for Galera. So as you know many open stack services still use select for update or doesn't have retry on failed transaction. So we are working with our developers to push this code to upstream to allow multiple rights to all Galera nodes instead of one. Okay and the same goes for European developing to test them. Because we are not really sure that they work completely fine right now. We are also going to introduce support for fencing for pacemaker using all the types of fencing devices. Also we wanted to implement centralized event driven fellow render vocations using common monitoring systems. Memcache still needs some polishing. We also want to concentrate on research of using 0MQ instead of MQP or maybe combination of MQP and 0MQ for messaging. And also as we have already working HA tests framework that can test your installation completely. We are going to, as soon as we, I think it will be really soon, as soon as we provide ability to deploy multi-node HA environments from particular OpenStack comments I think it will come in less than a month. We are going to introduce our CI gate into OpenStack Jenkins to see if any particular commit is breaking HA installation. So if you are interested you can always check out our project, our project's VQ page. Actual deployment is done by few library subprojects and we periodically write down all HA fixes into other part documents so you can check it out and find whether there is something useful for you. And you can always contact us at rcfieldfchannel at OpenStack mailing please. And finally I would like to thank all of the guys who are working hard to make this HA story work. This is Alexander Dedenko, Bakdan Dabrila, Roman Padalyak, Yurit Raday, Serge Millikan, Stan Lagun, Sergei Vasidenko, Matthew Mozesan, Dmitri Lin, Dmitri Baradayenko, Ryan Andrew Woodward and Cyr Lapova. I don't see here Tatiana Leantovich, Igor Katko, Artem Pachynko and Andriy Stazinski. And all of you for your attention. Thank you. Thank you. So if you have questions just ask with just... We have 14 minutes. Raise your hand. Could you please give the microphone? I think this session is recorded. Yes. Right. First of all, guys, Mirantis fuel is awesome. And good job, guys. Thank you. My question is around the roles that you provide, especially your controller nodes, all your API services, database and network, everything run on the controller nodes. Have you got any plans to split that controller role? The answer is yes. We are going to... It's our deployment architecture limitation right now, but I think in half a year we will be able to deploy any particular configuration and combination of roles you want on any node. Thank you. I've got another question. This is on your logging. All the logs go on to your fuel master. Have you got any plans to split that logging and send somewhere else? Actually, in FuelMaster we have option to send all logs to remote server. So we can use it even right now. But yes, we are going to redesign our log architecture and guys from FuelLibrary are working on that. Yes, sir. You mentioned about self-life migration. So how did the guy manage to do that? I think the best person who can explain that is Andrew Woodward. He actually worked on this part. I'll catch you later. You can tell it right now if you want. So you mentioned something about the pacemaker catalog and some way how you can track what services and in which order they start using the pacemaker. So you read information from the pacemaker database somehow to just track this. You mean about pacemaker puppet service provider? Yes. We are just parsing XML from pacemaker and parsing local resource manager part. And no. Eventually, well, it is completely based on the code that there is in P engines that shows you all the information about how in which state services are and CRM and PCS are showing this information. So it is rewritten in Ruby and it just does the same stuff. So you basically to start the services in the certain order you parse the pacemaker XML file. We parse it to derive its properties and status. Yeah. In order to start, we just put pacemaker cluster into asymmetric mode and then put zero location constraint. But if it is not started anywhere and then we unbend the node and then pacemaker starts the service and we wait for it to start or to stop whatever you want to do and see whether it is no. It is just like a regular service. So we just need to alter a tiny piece of pipet manifests. All right. Thanks. However, there are still some tricky logic as you cannot notify pacemaker just to load one service as it will reload in all nodes. So we just put constraints or just bend one node reload service and then we add it back to cluster. So just to eliminate such issues across all controllers. All right. Okay. Thanks. More questions? We still have like five minutes for questions. I think Fabio is surprised. Yeah. Actually, yes. We have plans and we are discussing with PuppetLab community to add to PuppetLab. Yes, we will be adding so but on you can actually use our OCAF script even right now. It's in our code. It's open source. Yes. The answer is yes. We are trying to be as open source as possible. Sergey, it is awesome. It just went slipping. Okay. Thank you guys.