 Yeah, so it seems like we're about in time and I believe we can begin. So thank you all for joining me today, for our early first session for today. Thank you for joining me after last night's parties. Actually I was quite surprised that people actually came after the parties and usually people take their time with their handovers this morning. So thank you all for joining. So I'm Arthur Berezin, Senior Technical Product Manager for OpenStack based out of Israel. And today I'm going to talk about a high availability for OpenStack and basically discuss the architecture of how to build in a highly available OpenStack environment basically what should we concentrate on and what should we resolve while we're building this environment. So I'm going to focus in on the enabling services that we use or that we can use to build a highly available OpenStack environment and to cover all the services and make sure that they survive failures. I'm going to discuss the various shared services like MariaDB and the message queuing that obviously we have to use and we have to make sure it's not, it's fault tolerant. I'm going to cover the various OpenStack services or at least the basic ones as we don't have too much time to date. And I'm going to end up by covering some of the topologies and what we should be taking, what should we think about when we basically build such an environment. So before we begin, who of you has an iPhone in his pocket? Please? All right. And Android users? All right. Much more. So I'm hoping you're all familiar with the QR function. You can use it now instead of taking photos during the slide deck. So feel free to take a shot of the QR code and just download the deck on the slide, the on slide share. All right. And I'm going to share that by the end of the session as well. So if you don't have a chance to actually do it now, you'll be able to do it by the end of the session as well. So this is the environment we're trying to build, right? We're building our top-notch environment with high-performing, high-performing compute nodes, high-performing controller nodes. And everything should be always running without tolerating any failures on the environment. If something slips by just a millisecond, so we're lost the race. But as we all know, reality is not always that simple. So obviously, yeah. So obviously, failures do happen and we have to encounter for them and something fails. With OpenStack, obviously, if one of the controlling services fails, it doesn't mean like the actual virtual machines stop functioning, right? The environment still probably keeps on running, depending on the failure, obviously. But it is extremely important to make sure our controller's controlling services are false tolerant and if something happens while the environment runs and things always happen, so we need to make sure we have another spare wheel that actually controls the car after the failure occurs. So we are trying to build active-active services. We don't want to wait for failovers to happen because failovers take time. It takes time to start another service on another node, for example, or bring the first node to keep on running, so we would like to make everything as active-active as possible. Obviously, there are some limitations in OpenStack today that we're still trying to solve and it will take us another cycle or two to solve them and there are lots of very good progress on other services as well. So and again, if nothing else, if the service itself does not support active-active configuration, so obviously we can still fall back to active-passive, but we would like to do that as least as possible. And we would like to create a scale-out environment, right? We want to make sure our environment can serve as many requests as possible and serve as many users as possible. And this is the reality we've all seen, this diagram obviously opens like a fairly complex system constructed out of many services and we need to make sure in each and every one of them is highly available. So first of all, I'm going to cover some of the enabling technologies that we use to build a highly available environment. So the first is SpaceMaker. So SpaceMaker is a cluster resource manager, basically what it does, it monitors a bunch of resources that it supports and once it detects something happening with one of its resources, it makes sure the resource comes back to life, either by restarting the service or by fencing the node and we're going to cover that just in a bit. So SpaceMaker supports various types of resources. One of them is the virtual IP, which is basically a floating IP that can move from one host to another and basically the virtual IP resists on one host and having a single IP address for both nodes that run the service. SpaceMaker also supports system D monitoring of system D services. So we don't run the services themselves by system D, rather, SpaceMaker monitors those services and controls them. So we don't have to enable the system D services themselves. SpaceMaker takes care of that. And SpaceMaker also supports clone services, which basically mean active, active configuration. So when a SpaceMaker cluster is configured in a clone configuration, the service runs actively on all the hosts that the cluster is configured on. And another very important function of SpaceMaker is the shoot the other node in the head function. So when something weird is going on and SpaceMaker is not able to actually reach the service or not able to restart the service, for example, there's a failure occurring when you restart a service or such scenarios. And we have to make sure that we have full control of the service, right? We don't want to create any split brain scenarios or any scenarios where service is out of control. So we basically should be able to control the host itself by using its power management, most commonly the IPMI device for that host. So for OpenStack, we use SpaceMaker basically by creating a virtual IP address for each and every API endpoint. So basically the users address the API, the virtual IP address. So once failure occurs, obviously, you don't have to reconfigure anything. And you don't have to let other services know of your new IP address. You just use a single IP address that bounces between the services themselves. We also use the system declone the resource functionality, which means, again, SpaceMaker monitors the services themselves and controls the system resources. And we configured to shoot the other node in the head functionality, in case something funky is happening in the environment. So, SpaceMaker is constructed of a single service, the transmitter host called the PCSD, and it uses the IPMI or the remote management of that host. Another technology that we use is HEProxy, which does load balance. So HEProxy is obviously a very popular web load balancer and proxy. It's capable of handling HTTP and TCP requests and it does it very well, and this is why it's extremely popular for web applications. And it also does health check monitoring. So basically, it makes sure that the IP addresses that it balances to are alive, and if something is happening, HEProxy stop distributing, stop sending requests to nodes that are not responding. And obviously it has load distribution. So we can use various load distributions with OpenStack, and it's really dependent on the service. I'm gonna cover some of the services and their policies. But by the end of the session, I'm gonna point you to the complete reference architecture that you can go over and see for each of the services. What are the recommended configurations? So some of them are round robbing. So when we have complete stateless services, we don't really care for the session to go to the same node every time a user makes a request. So HEProxy just does load distribution and just does round robbing across all the nodes it's configured to use. Or we can use StickTable to make sure that the session that started using a specific node, it's next request they're gonna use that node. So it also does the API isolation, so we don't address the API directly. And it also does failure detection. Which basically if one of the nodes fail, HEProxy stops distributing the load to that node. So let's go over the life cycle of such an environment using those components. So we'll take an example Horizon. So user try to connect to access Horizon Web UI. It could be Horizon or it could be any other stateless status service. So obviously this is a single point of failure which we would like to address first. So we can put an HEProxy in front of it. And HEProxy distributes the load between the three nodes for Horizon. Now, again, this still creates us a single point of failure. Because HEProxy by itself is a single point of failure now. And each of the Horizon components could also fail. And we should make sure that these services are rebooted or restarted once something funny is going on. So we build an HAP and a pacemaker cluster, a cloned cluster around the Horizon services. Which takes care of any failures for the HAProxy, for the Horizon services. So if something going on, pacemaker basically takes care of that and reboots the service. And if it fails to reboot the service itself, it restarts the node. And for the HAProxy, we can also create a pacemaker cluster for HAProxy itself. So if something happens to HAProxy and it fails, pacemaker can restart the service. And also, we can use a virtual IP for this specific service. So basically the flow for this environment would go from specific node. And that specific node that runs that virtual IP is gonna load balance the request coming in and basically send it over to one of the Horizon instances. So for the shared services, so first of all, the database that we like to care of, we can use a GaleraDB, we can use Galera with MariaDB, sorry. And what Galera does is basically it is a multi master node replication for the database, it does a row based replication for the database. And it basically allows us having multiple active running nodes, running the database services themselves, and basically access each of the nodes having a consistent data for across the nodes. Also supports node auto joining, recovery, conflict resolution, etc. Lots of really cool functionalities. Basically, literally what we need to make database highly available and specifically for open stack, this is really a good use case. Also supports the native client APIs for MariaDB, which means we don't have to introduce any changes to the environment itself. We can just point it to specific IP address of that node and basically replication takes care of, happens underneath this service. So the other functionality we use, we use RabbitMQ mirrored queues. Another functionality, clustering enabled functionality for RabbitMQ. Which basically replicates the message queues for each of the services across multiple nodes. This is functionality of RabbitMQ that we can leverage here to make sure that the message bus itself is highly available. So for the services themselves, Keystone is rather simple service. It runs under Apache under HTTPD, obviously multi-threaded. And this service basically accesses all the identity resources and accesses database for assignments or could also store the identities in the database itself. So again, we use pacemaker and Keystone to make an HAProxy to make the Keystone itself highly available. Rather simple service, so we have the virtual IP running on one of the nodes. And that node that has the virtual IP basically makes all the load balancing for the API calls for Keystone. And the Keystone service basically does access the environment. So the important thing to keep note here is that each of the hosts should have the same identical SSL certifications. Otherwise, you're gonna have some weird behaviors. So when we're configuring this environment, we have to make sure we copy the SSL for all of the nodes. And we also then need to take notice that the caching for each of the services is local. So obviously it would be a bit more efficient to have the calls coming to the same node every time. So Glance is also a rather simple service. It has two Linux services. One is Glance API, which takes care of all the API calls. And it also accesses the storage and we also have Glance Registry, which communicates with Glance API over HTTP calls. And Glance Registry takes care of registering all the interactions with Glance to the database. So again here, as we used with Keystone, but this time we're doing it for both services. We have two virtual IP addresses. One is for Glance API, the other is for Glance Registry. And both of them obviously are load balanced across the nodes. So again, as the flow, we have one node running the load balancing for the API calls and the other running the load balancing for the registry calls. So there's a bit more complex service, but also not that complex. We have one issue with it specifically. So we have four services. The first is API, which takes care of all the API calls. We have the scheduler, which is responsible for placing volume. So once I send a call to create a new volume, and sender scheduler decides on which sender volume to create that specific volume, obviously, based on filters and weights. And the sender volume itself makes all the storage calls. It uses basically the data path to the storage device we're using. Rather, it could be a distributed storage or traditional storage. It doesn't really matter. It uses the driver to access the storage via the management path to that storage. And then the API passes to Nova the data path. So basically the VMX accesses the storage directly. So the only thing to note here is basically that sender volume is still not capable of running active-active due to some potential race conditions. So for example, if you make the same volume creation call with multiple volumes, there could be some conflicts while the environment runs. So it is currently recommended to have this as an active-passive configuration. But lucky us, we have the design session as well during the OpenStack Summit. And basically this is one of the topics that the sender developers are currently trying to address with various locking mechanisms for sender volumes to make sure that there aren't any race conditions within the service itself. And sender volume is still active-passive as well. And I'm assuming that once the sender volume is taken care of, we can use the same principles to apply on the sender backup as well. So Nova is constructed of Nova API service, which obviously does all the API calls. We have the Nova schedulers, which does the VM placement, chooses the VMs based on weights and filters. We have the Nova conductor, which does all the database access. And basically every command that the Nova compute service does, the Nova conductor makes that register over to the database. And Nova compute runs the actual virtual machine instances using various drivers, most commonly a KVM with Libvert driver. So usually OpenStack is deployed with configuration of controller services and compute services and compute nodes, sorry. So for the controller services, all of them are capable. All of them are basically stateless services and all of them are capable of running in an active-active environment without any issues. But for the compute services themselves are still their own single point of failures, we do have system de-monitoring the Nova compute service. So if something happens to the Nova compute system, it still tries to recover that. But we don't have any virtual machine VM availability. And that's not yet fully supported. We do have some reference architectures for that, trying to solve this. And basically there are a few blueprints for Nova to enable this and basically to help Nova recognize that node failed much more quickly. So there are some blueprints to cover that as well. But this is not something that is fully supported at least by Red Hat. So the way it's gonna work once there are, as I mentioned, there are some reference architectures trying to solve this. And the way it should work basically is, we use something called pacemaker remote. And what this means is basically that the compute service is not part of a cluster for pacemakers since pacemaker supports only up to 16 nodes. Rather, it uses this node only to monitor its state with coursing. But it doesn't do any actions on the node itself. But it still can use its controlling mechanism to make sure that the service is rebooted once something weird is going on. And once we can actually make sure that the service, that the node itself was rebooted, we can start the instances on another node. Now, one little disclaimer, this still relies on shared storage for the VMs. So we cannot use any thermal storage on local disk, obviously. This will not work on local storage. So Neutron, one of the, I think, most complex services today with OpenStack. Constructed out of several services, the first is the Neutron server, which does all the API calls and all the management for Neutron. Runs all the commands, etc. Uses the message bus to communicate with all its services. And this is the only service that actually makes database calls and saves all the states of the networks over the database. It also has a Layer 2 agent, most commonly used with OpenVswitch. But also there's a bunch of other underlying supporting technologies, Linux bridges and other vendors, plugins for Layer 2 agents. The Layer 3 agent is a more interesting service. It does all the routing and all the SNAT configuration for Neutron. Basically allowing the northbound connections or routing between virtual networks for Neutron. So we also have the DHCP agent, which gives us DHCP services for the agents, for the instances. So once we boot an instance, it gets it IP from the DHCP agent, which uses DNS mask in the default configuration. And we also have the load balancing as a service agent, which uses HAProxy for load balancing. So if you take a look at the flow of the environment, we have, again, this is a common deployment mode. So we take a look at this environment. We have on all the nodes that actually do any networking activity. And this obviously the compute nodes and the Neutron network nodes. Those hosts run the Layer 2 agents, which are basically responsible for configuring the Layer 2 configuration. On those nodes and basically constructing all that, creating all the tunneling needed for communication, etc. And then we have the Neutron network node running the Layer 3 agent, which does all the routing and creates all the routers to basically enable communication between various tenants. And also enable communication to the outside world, to the external network, or to the internet, obviously. So the Neutron service itself is rather easy. It's the Neutron server, not Neutron API. This is a rather easy service. It's a stateless service as much as any other API service. So there's no problem running it in a cloned configuration and make it run on all the nodes simultaneously. The DHCP, so there were a few changes in the latest cycle, in Kilo cycle basically that enabled running the DHCP agent in active, active configuration. This is a Neutron configuration that you would have to enable. And you would also have to make sure that the Neutron DHCP service, the Linux service that runs on the host, it's that it is highly available as well. So you would have to create the pacemaker cluster on top of that. In addition to the Neutron configuration enabling the DHCP services being highly available. So the second part is also the Layer 3 agent, which is responsible for creating the virtual routers on the network nodes. So also this also relies on various changes happening in the Kilo cycle, basically enabling a VRRP between the Layer 3 agents. So what it does is basically it creates another, the service itself, the Linux service runs in a cloned configuration. So the Linux service is active, active across the nodes. But each of the nodes, since there's communication between the Layer 3 agents over VRRP, we create another passive router on another node. So in this example, if this Layer 3 agent fails, we already have a passive configuration for a router for this environment. So if this node fails, the other router will become active and will start serving as the router for this environment. So as I mentioned, another few changes happened in Kilo cycle, basically enabling HA with VRRP for the Layer 3 agent, which was one of the major failure points for Neutron. Also enabled the HA for the DHCP agents. And there are some plans for Liberty as well to enable DVR, which allow us to create distributed virtual routers on each of the compute nodes. So what this would mean is basically we would have the routers distributed across all the nodes, so we wouldn't have to create the single Neutron controller nodes. Another very interesting configuration is basically running the DVR distributed virtual routing alongside with VRRP. So for example, we would have the routing themselves done within the compute nodes, but running the northbound traffic accessing external networks using VRRP. So this would actually give us a really reliable environment that would be able to access it, would be able to segregate the external networks, but also make the east to west traffic be routed more without going to the Neutron node. And there's another initiative also making the DHCP services fully distributed, but this is still not fully targeted and this is still under research. So going back to Verizon, one of our simplest services. Again, Verizon uses all the APIs to display its data and to make any changes in the environment. It does all the direct API calls for all the services and it's as simple as Keystone actually, from the architectural perspective. Simple Django application making the API calls under Apache HTTPV. So again, we have here a simple, rather simple, active-active pacemaker configuration with HAProxy configured. So I'm gonna address this a bit cautiously since each and every one of us has his own environment, is interested in implementing his own best practice or his own use case. There are multiple use cases for OpenStack, obviously. We all love OpenStack because it's such a robust system that allows us basically to configure those configurations. So the most common configuration is multiple controller nodes. As a three controller node, the environment running all the services, including the database, including the message bus, and all the controlling services, and having multiple compute nodes accessing the environment, serving as the VM hosts. But we can also obviously segregate some of the services. So we can, for example, pull the neutral services to run on their own separate dedicated nodes. For example, to have all the networking traffic covered by those nodes, but make sure our APIs are segregated from the network nodes, for example. Another example, we could segregate our storage nodes, basically covering Cinder and Glance on their own dedicated hosts accessing the storage without enabling access to our data store, to our storage devices by the controllers, for example, running the API services. Obviously, another example would be to segregate the message bus and database through their own dedicated hosts. And this basically allows us to make the environment more scalable since we were able to serve more calls by each of the services. So some resources, you got the QR code to download the deck. There are a few really interesting resources and really great blog posts on various components, highly recommend going over them. And there are a bunch of really great talks both by Reddit guys and by guys actually deploying such environments in the real world, basically discussing various trade-offs, which services could you use, basically covering pacemaker versus keep alive there or many other configurations. So I highly recommend going over the list during the schedule. There are a bunch of really great talks, I highly recommend. Thank you, and you can download the code if you'd like. Download the deck if you'd like. And if you have any questions, there's a mic over there. So feel free to go over the mic and ask questions. Anyone have a question? Yeah, what is the status of, I noticed you didn't mention like the two's library and any of the zookeeper work to overcome some of the problems that people were having with the MariaDB glare problems with the locking? Yeah, so the question was, how do we address the locking issues with Maria? So yeah, there are some locking issues resulting some conflicts that you would have to manually resolve. At the moment, the best practice is to use a single node for the right access. So it wouldn't create any conflicts and multiple nodes for reading for readings and this basically avoids the conflicts. But again, this is still under investigation from the Galera team itself. So it should be resolved, sometimes hopefully. This is from scaling perspective. So we can start with the three controllers initially, but then how do we go about identifying the scaling bottlenecks? Yeah, so the question was, how do we identify the bottlenecks? So this is a good question and this is actually a really kind of tough thing that you would have to go service by service and identify the bottlenecks. There are various tools for performance measuring, making sure which of the components, which of the services is your bottleneck and then scaling that specific service out, no by no, making sure your environment survives, survives the scale that you're serving. But basically, the best practice today is basically going service by service and identifying the bottlenecks themselves and then addressing them specifically. Do you have any use cases for leveraging this HA mechanism for the upgrade, for example, because you have three controllers. So you can do one of the controllers while another one is active. I don't know if you have experienced plans or can you comment on that? Yeah, so upgrades, so yeah. So at the moment, do you mean like any automatic upgrades or just upgrades in general? No, just upgrades in general, because if you have three controllers, yes, so you can basically just shut down one of the controllers to the upgrade, right? And while the other two are still active and running. Yeah, so there are various strategies with upgrading. You can do this by running controller by controller upgrade. This is one strategy, another strategy you can take, you can run service by service upgrades, this is also possible in some cases. You guys don't have like a clear reference how to do this. So yeah, so actually we do address all of them in our documentation, running both multiple service, running simultaneous controllers or service by service. Again, it's up to the use case. There aren't any like, there isn't any like best of them, right? You would have to choose your own strategy. So my next question is about some of the services on your presentation were covered by the system D. So system D can make sure that service is running, right? And in the same time, they were covered by a pacemaker. Are you aware about any problems or conflict with this? Because I see that there's a potential place where you can have like in edge case scenarios, you can have problems and kind of tricky to troubleshoot and resolve them. Exactly, so the question was, what happens when I have a service managed by, so our service is managed by system D or by pacemaker. So exactly, so you don't run the services managed by both. You would have to enable them only in a single place. Either by system D or pacemaker. Exactly, either system D or pacemaker. So for example, for VMHA, you would have to make that run by pacemaker not system D, but most commonly it's done via system D. Microphone, come on, we want to hear you. Pacemaker actually coordinates with system D. So pacemaker can speak to system D and say, system D, I want you to start this, but I want to have control over it. So pacemaker is actually controlling services through system D in a way. So system D in itself can automatically restart and monitor things. But pacemaker can start a service and say, I don't want to use system D to actually monitor this. I want to monitor it and I want to have complete control over the service. That we're utilizing the fact that system D has these nice unit files and things like that for pacemaker to manage. So yeah, thanks. Any more questions? All right guys, thank you so much and thanks for making this early session.