 He's going to turn me up here in a minute. So I don't have control over it. There we go. They're much better. Oh, I don't like that. Can you hear me? I can hear myself in stereo. Good morning, everybody. My name is Shane. Hopefully you're here for control plane architectures. If you're not, the doors are right there. Please exit orderly. Let's see. Many of you often like to take pictures of slides. And I'm not going to discourage that. But if you have a QR reader, my slides are posted on slideshare already. So you're welcome to go ahead and just take a pic of the QR there. It'll be on some of the first few slides. So if you miss it on this slide, don't worry. It'll also be at the very end. And then you don't have to take pictures and try and remember the context of an individual slide. Or maybe you just thought my slides were so awesome. You want to rip them off. That's OK. So very important stuff. The lawyers at Zerostack would be very upset with me. If I didn't put out the very important legal disclaimers and information. So hopefully you guys read all of that and understand it. Because I don't understand lawyers. That's for sure. About me, my name's Shane. And I'm a cloud infrastructure architect. It's been a long journey for me. Starting in the Marine Corps, I was a mainframe computer operator, tape ape, paper jockey, moved into mainframe hardware, moved into worldwide support, and this thing called UNIX operating systems. And from there, my career trajectory has taken me from systems administration, systems architecture, network engineer, network architecture, security, and most recently, cloud infrastructure architect. I've worked at a number of companies from small to big. So I've seen a tremendous amount of scale differences from small startups to big enterprise companies. Most recently, I was working at Symantec, where I was responsible for the founding technical expertise in the cloud platform engineering team that started an internal open-stack cloud platform solution for Symantec. When I exited Symantec to go to zero stack, we were already at about 2,000 physical systems within the CPE environment. And they've only continued to grow since then. So I've seen some scale differences from small to big, and hopefully that'll be reflected throughout my conversation here as I help try and guide some of you on your journey to cloud. I have some fancy transitions here, which I tried to remove, I failed to do. But what we're going to talk about today, our agenda, what we're going to talk about, and what we're not going to talk about. It's kind of important to couch the conversation in terms of some of the things we won't talk about. Not to say that they're not important. I could probably stand here for three weeks and pontificate about architecture and all the very different aspects, pieces, and parts of it. But they didn't give me three weeks. They gave me, what did I get, 40 minutes? OK, 40 minutes to talk about it. We'll talk about the problem statement, needs analysis, solutions summary, and some of the traditional wrap up stuff. I've been asked if you have questions to please use the microphones here since it's recorded. If you do have questions, please feel free to interrupt me if they're important to the context of the conversation at that moment. I don't mind being interactive with the presentation, but if the questions can be held to end, let's go ahead and do that so we can try and get powered through all these slides. Like I said, I tried to jam three weeks of discussion into about 40 minutes. Am I auto-advancing? Did it auto-advance on me? All right, so we'll talk about control plane and data plane. It's important to get it kind of a level set, since we're talking about control plane architectures, what the control plane is, what the data plane is. We won't go on very much about it in too much detail, because I think hopefully most of you have a basic understanding of what that is. I want to make sure everybody in the room has a level set. We'll talk a little bit about general HA, how you apply HA design patterns to solutions, architecture, and general, not necessarily as how it relates to OpenStack. And then we'll discuss how it relates specifically to OpenStack in four of the primary architecture design patterns that I see. And that's standalone. And I'm serious about this. A lot of people will scoff when I say standalone for control plane for OpenStack, especially if you want to scale it. There's some good reasons and potential why you would actually choose standalone over something more complex. We'll talk about the active passive. We'll talk about the fully redundant or the active active. And we'll talk about something that's a little different than what most of you may have seen in the past, and that's a distributed control plane embedded throughout the cluster. And we'll talk about the architecture, the design solutions. One of the things I'm going to try and cover and touch on the architecture from sort of a high level abstract and not get too deep in the weeds. So some of you that do know some of these services are probably going to be poking holes and going, well, what about this? What about that? What about this? It doesn't apply to this version. This part of the NOVA API suite or whatever. You're absolutely right. And like I said, I want to try and keep these at sort of a high level. So there are some parts and pieces to each of the individual services where the discussion in the architecture may not relate. So we're trying to keep it relatively high level, but also architecturally deep enough that it gives you some good ideas and information on the architecture or control plane. So what we're not going to talk about, ancillary services. Ancillary services are important because they do a lot of the things that back our cloud infrastructure, that make it actually work. We're not going to talk about things like how you design Active Directory at LDAP that backs your Keystone instance, because, well, that's probably a week's discussion right there for some people who love Active Directory and LDAP. I hate it, but it's very important. And so you could go on and on and on about those sorts of pieces and parts. Server load balancers, server load balancer architecture on the surface seems very simple. The reality is it can be incredibly complex all the way from hardware-based server load balancers to appliances to the good old standby that most people in OpenStack are familiar with, which is HA proxy. OK, so I'm lying a little bit. We'll talk about load balancers because they're critical to HA and your control plane. But we're not going to talk about complex load balancer architecture if that makes sense. The network control plane has its own set of issues and problems. If you're running anything more complex than just Neutron with one of the basic plugins, if you're running something like OpenConTrail from Juniper, the minimum bar of entry is, if I remember correctly, now it's five controllers just for the network control plane. So we're not going to talk about that. That's something that's very complex and will be specific to each and every deployment and each and every SDN configuration that you go with. Physical infrastructure, we'll very briefly touch about that, touch on that only because it's important in a standalone service control plane environment. Complex DBs, you get the picture right. There's a lot of things that we could talk about. So what is the control plane? I served in the Marine Corps, so I'm very familiar with Marine drill instructors, drill sergeants. And the drill sergeant is actually very much like the control plane. They bark orders at you. You as a recruit, you do them. You're the data plane. So that's kind of a real basic analogy from my old background that might make sense to some of you. When we relate that to OpenStack, the control plane issues commands. Create a port, create a network, create a subnet, create a router, attach my instance to this. Those are all commands and signals that are actually initiated by the control plane. The actual doing it is the data plane. In this case, it's our poor little recruits. If you feel dizzy and you want to pass out, please pass out at the position of attention. I have been told that in Marine Corps boot camp. Data plane we kind of talked about. It's sort of your east-west traffic. It's the actual instantiating the instance. It's taking the glance image and building a VM out of that. It's all of that heavy lifting work. It's the network conversation east-west, north-south, in and out of your environment. That's these poor recruits in this example. So let's kind of level set. Let's talk about the problem. What are we trying to do? So most people's trajectory when it comes to OpenStack is they want to learn a little bit about OpenStack. They want to see if it's applicable for their environment. They want to build an OpenStack POC. So they'll go a number of routes. There's a simple DevStack and a VM. There's DevStack on its own. There's all kinds of recipes and all kinds of ways you can get a quick and dirty OpenStack solution up and running. And some of you are going to be going, damn, I've mastered this stuff. What are all these people talking about OpenStack's complex and hard? Come on. I caught DevStack up and running in a day, right? Got a VM going, woo-hoo. Okay, so the reality when you take that initial POC experience and you want to take it to production, the world shifts dramatically. And the complexity of OpenStack starts becoming very overwhelming very quickly. And so typically you're going to be faced with, how do I architect this? What do I need to put in place? How many services do I need, et cetera? So basically you're going to need to sort of back step and do what I kind of refer to taking from the test driven design, test driven sort of models, and that's test driven architecture. In this case, what's the test for your cloud platform? It's how your users are going to use it. How your users interact with it, whoever your user base is. It might be you as an internal department. It might be external departments to yours that are consuming it. You might be a service provider that's providing it as a platform service that's either focused for a specific application, or it might be a general use cloud platform solution of some sort. So you need to understand your users' needs and requirements, what their expectation is of that cloud platform and how they're going to use it. Because all of those inputs are going to drive your decisions on your control plane, yes, ultimately on your data plane and all the other pieces and parts. We're going to talk about the control plane here today. One specific note, you're going to see this thread throughout my slides. If your scale is a dozen hypervisors, a couple dozen hypervisors, maybe you're just a small platform of 100 to 200 hypervisors, complexity can be a killer. And we'll talk about that a little bit further on. But complexity may be required depending on your requirements. If you have a five nines requirement, first of all, that's tough folks. That's really tough five nines. That's less than a second of downtime per day aggregated over the year. So you may very much have a high availability requirement that's going to dictate and drive a complex design. But you need to know what you're getting into so that you don't get in over your head. So needs analysis, pretty basic. We talked about it a little bit previously. Understand your user base, understand your test driven architecture. How is your platform going to be executed and used? How much uptime do you need? And this is where you're going to have to be honest with yourself. Gut check. How good of people do you have in staff to do this? How much can you hire to get up to speed? How much can you borrow or rent from one of the services companies orbiting around the open stack sphere? Those are all important metrics that might go into your decision process and how much you build right now versus how much you might build later. You might need to do an iterative architecture design where for the next six months to a year, you're going to use something very simple that you can get up and running and working in operation. And then in six months to a year, you know you're going to have to rearchitect it and you might need to scale to something a little bit bigger. Those are important considerations in how you design today versus the future. So some of the things that you might come down to, the black and white metrics that a lot of the pencil pushers or management's going to want to know, what's your availability requirements? 98%, 95%, most people will never shoot for 95%. But realistically, it might be right for you in some situations. If you're just a small house with a dozen hypervisors and you're doing just playground sort of work and who cares if 95% on your control plane is what you're shooting for, that might be sufficient. Don't go build yourself a 99,999% control plane for that sort of environment. You'll end up with something far worse, like 70 or 80% uptime. So be realistic about what your requirements are so that you can help guide your architecture and design decision. I don't expect anyone to digest this but the key takeaways are 95% in a day is an hour and 12 minutes downtime, 99,999 is what did I say, 0.9 seconds less than a second of downtime in a day. So these are sort of a map you might look at if you're going to be uptime driven in terms of your control plane architecture and how it relates to your design. This is based on the leap year and whatever. Match your uptime, downtime thresholds, determine how much talent. We talked about that a little bit. It's critical not to try and get yourself in over your head. I've done it many times myself and I've spent many, many late nights up surfing the web and deep in books before the web was a big resource. I had stacks of books like this learning that UNIX, UNIX was not UNIX, E-U-N-U-C-H-S, I was very happy to hear that. There's a long story that goes after that. Buy me a beer later and I'll tell you about it. So it's critical that you don't get in over your head and do something and try and deliver something that you're not going to be able to deliver because ultimately if your cloud platform is not successful for your user base or whatever your intended audience is, it's not going to be well received. Here is my completely and it may very well be completely bogus guideline for architecture and the reason it might be completely bogus is because there are so many different inputs that you might have that may determine that percentage level of uptime just isn't the right way to pick your architecture, lots of ways to pick your architecture. Maybe you have a requirement for a very highly available control plane, but you actually don't do much on the data plane side. So an example of that might be you have a CICD environment where you're churning hundreds of container instances on your control plane. You're spinning up lots and lots of VMs fast and tearing them down and doing testing. You might be critically tied to that control plane much more so than to your data plane and to what is actually happening in the data plane. So that might actually change the metrics in terms of how you're determining what control plane solution you want to go with. All right, so let's talk first, I don't have a clock running here. It's a new way of running things. When am I supposed to be done here, 12.40? Everybody's going 12.30 lunch, 12.30 lunch, right, right? So time check. So let's talk a little bit about services. And the services are components of our control plane. So it's kind of important to understand the different patterns that we can apply to each of the individual services and how we architect the availability for those services. And the reality is OpenStack is very complex for lots of moving parts. You're going to choose A, B, C, and D parts of OpenStack and throw away E, F, and G. You may not use all of the suite of OpenStack services. And so how you architect each individual service within a highly available control plane solution is going to be different from other services. So understanding those is critical in how you're going to do that. Okay, I'm going to get back on my soap box here on standalone. Standalone design, very simple to build, very fast to build, single piece of hardware, standard up. But if you are intending to operate the cloud at any level of utilization and capacity, it's important that you bake in a little redundancy into the physical platform itself. And this is where I depart from the things we're not going to talk about. And we'll talk about them a little bit. And that's things like dual NICs. You might have a bond configuration top of racks, which is with MLAG between them dual power supplies. Presumably, you might have this in a data center where you have battery back and generator backed and air conditioning redundancy. And the physical environment isn't very important to a standalone service. And it's availability because those physical factors are often far more detrimental to a standalone server than a cluster of services. So this is obviously driving a little bit more towards the pets side of the control plane discussion. This is going to be more of a pet for you that you want to take care of and love and cherish and maybe even name or Betsy or Fred or whatever your interest is. Hardware rate is going to be critical in this case. And so if we have a traditional sort of active passive service architecture, there are a couple ways you can go about that some services have active passive baked into them already. MySQL and Rabbit are examples of those. In the case of MySQL, MySQL can actively replicate its data to a standby system that can take over the workload for you. And in fact, you can have n number of standby services that the data is being replicated to and you can do a read plus write and a read read read solution based on those. Do please go read the architectures on say MySQL in this instance on how to do that because you can get yourself in a lot of trouble if you do things wrong there. If in the case of a service that does not have its own baked in active passive solution, you're going to have to provide that. And you do that externally to the service. Traditionally we do that with a load balancer in a virtual IP address. It's an IP address that represents that service. And if that service fails on your active system, it fails over to your standby system. Underneath the hood, you might be replicating the data. Underneath the hood, the service doesn't know its data is being replicated. That might actually be MySQL replication under the hood or it might be DRBD or it might be Gluster FS or it might be a CEF store of some sort. There's a number of ways that you can do that and provide active passive availability for the service outside of it. There are a lot of services that are clustered that have high availability that are built into them from the beginning. It's typically based on some sort of quorum solution, some sort of gossip or quorum. Some of you have probably heard of quorum gossip raft algorithms and all of these sorts of solutions which are responsible for clustered solutions where you have a leader that's elected. And you have followers that follow the leader and then there's active discussion amongst the service itself about who should be the leader or the leader went away, who's the new leader. And then the service itself will replicate the data. In the case here, you see ABC data has been replicated across the three systems. It's important to note in almost every clustered, active, active solution out there that you need to base your services on odd numbers because of quorum protocols, so typically three, five, seven. So if you lose one system out of three, you still have two that can form quorum. If you have five systems, you lose two. You can still form quorum and elect new leaders and continue on in the service. And there, I didn't like this slide. There's a discussion on this slide. So there's a couple of ways that you can physically architect your services. You can have a bare metal. Bare metal is easy, throw a couple of ansible playbooks at it, throw some salt stack, states at it, throw some chef recipes at it. It's easy to orchestrate and manage. Not a bad way of doing things, but obviously hyper hypervisors virtualization is very critical to cloud environments and it gives us a lot of flexibility. So you might actually put your control plane on multiple hypervisors and you'll orchestrate the services as individual VMs or shared VMs on that hypervisor and your separate control plane. And you may actually just use something like bare KVM in this case. Three KVM hypervisors, there's nothing wrong with that. It's not much that can go wrong with it. If you remember what I said earlier, simplicity is a very good tenant to follow by. The more complex the systems get, the more things can fail. The more south things can go on you. And then ultimately, there are some other solutions where you can architect clusters where you have an embedded control plane. And your control plane might be distributed throughout the actual data plane or the VM workloads that are running in your system. We'll talk a little bit more about that. It's something that we do at zero stack and it's a solution that may or may not be applicable for some of you. So let's take all of that sort of background information. Let's apply it to the control plane and let's talk about how that these architectures relate to the control plane. So the overview, again, we have standalone. Yes, I'm still going to talk about active passive, active active clusters, distributed systems, embedded across the cluster. Remember our drill sergeant? Let's sort of level set a little bit here. The drill sergeant is yelling the commands. And I probably confused things by reversing it. But on that slide, you've got the data plane, your actual compute nodes that are circled in green there. They're the ones that actually do the work, your control plane on the right side of the slide. As you face it, actually it's the right side of the slide. As I look at it here, this might relate to a highly available solution in a rack that maybe supports four, five hundred hypervisors. You might have two or three infrastructure racks where you have your top racks. You have your infrastructure servers are running DNS, NTP, configuration management services. You might have a key value store for your configuration management service. You might have your Git services for code repositories that back your infrastructure. You might have a builder machine for bare metal imaging. Then you have your actual controllers, one, two, three, four, five, six, seven, whatever it is. You might have separate databases. You might have network controllers and storage that backs your control plane solution, independent of your compute and storage systems. Standalone. We talked about the physical requirements about that. I won't go over that again. One of the big notes about a standalone service for your control plane is you can only scale up so much. You can only buy a big and beefy enough server before you run out of big and beefy server, and then at which point you need to scale out. So you need to understand where your cloud platform's going in terms of scale, and that is one of the big bugaboos with a standalone service. Now, one interesting thing, I was going through some of the presentations, and gentlemen, by the name of Edgar Magana, if I've mangled his name, I apologize. I don't know him, but he's discussing later today on OpenStack HA or not HA. And in fact, I believe from his abstract, he's recommending a standalone solution. So I'm not the only crazy bird out there talking about it as an appropriate solution. This is from Workday, who's operating a 200 node hypervisor cluster on a standalone control plane, if I understand his abstract correctly. So if you are interested in that as a solution, you might go talk to him and take a look at his talk later today, level four ballroom G at 530. Standalone, very basic. You're running one server. You got the services running on it. Another note I might make is go ahead and let's bacon, sort of HA proxy. I talk about HA proxy throughout my slide, or I've drawn it on a lot of my slides. There are a lot of other good solutions, and I don't want to be dissing. Specifically, Nginx has a very good solution for doing distributed HTTP communications amongst a cluster and HTTP connections. It's not just an HTTP web server. So there are other solutions besides HA proxy, but that's what you're going to see on most of my slides is what most of the OpenStack world is discussing or talking about. Check that HA proxy VIP in front of your services now. And then if you need to scale, it's a heck of a lot easier to scale those services. If you've got a meta service sitting in front of it, even on a standalone service, you can do some interesting things, even if it's so much as doing health checks to determine your services down and redirecting to a oops, we're sorry, page of some sort. So the users have some understanding that service is down and not just sitting there at a browser that's hanging and spinning and getting mad or and mad or and mad, or users do that. They get mad very quickly. And why standalone might be good. Before we delve into the world of active and cluster technologies, there are lots and lots and lots and lots of examples in the real world of where things have gone horribly wrong. And some of the names pulled out on here are components that are key and critical to a lot of the HA solutions that you might bake and build into any of your cloud platform environments. So if you pick a solution, you should be very, very careful and very aware of how to architect that individual solution because of things like split brain, specifically where you have a cluster that splits and it doesn't have quorum appropriately. And both sides are active doing work and you can actually destroy all of your data or large portions of your data if you do not do these things correctly. A gentleman by the name of Kyle Kingsbury who runs the Jepson tests and Jepson articles are very, very good resource. A lot of people are hot on MongoDB. Go read some of his articles on MongoDB and what he's done and how badly he can destroy MongoDB clusters using real world workload situations. And I'm not just picking on MongoDB. There's a whole raft of them out there. As a matter of fact, he talks about raft and the raft protocol. So maybe I haven't convinced you standalone's the way to go. You still think I'm a half quack, half baked architect and that you need at least something so you can sleep at night and you want to talk about active passive. There are a number of solutions for doing active passive. We touched on them briefly. And as we see here, my buddy Clint Eastwood, he's representing stoneth in this case. Shoot the other node in the head literally is what it sounds like. Shoot the other node dead and take over all the services. There are a lot of solutions for doing that. Active passive, most of the services aren't aware of the fact that they actually have a shadow partner over there, a doppelganger just waiting to take over and represent who they are in this world and duplicate what they're doing. So most of them are completely unaware of that. And that's where things like stoneth and pacemaker and chorosync and keep alive D and all of these other technologies that are used for implementing highly available solutions come into play. They are critical, but they are also a lot of caveats and warnings that go with them that you can go back at your leisure to the Jepsen articles. If that doesn't chill your blood, I don't know. Maybe it should get out of architecture or platform building because when I looked at it and looked at some of the failure scenarios, it's pretty interesting. So there are a lot of external services. We've talked about a few of them. There's a couple more of them listed here that you might use in terms of an active passive solution. How this is going to look, you have a primary mechanism. Server one is active right now. You have HA proxy. In this case, we're using an example of pacemaker and chorosync to correlate the external vips and the internal vips for the services that are pointing to Horizon Horizon. In this case, it's using Memcache D as an acceleration for the token storage, Keystone, Nova, and Neutron. You might have some sort of shared storage underneath that, whether it's DRBD, whether it's a sand-backed storage doing replication under the hood for you, whether it's CEP, whether it's a file store service of some sort, whether it's simply an NFS mount. If it's an NFS mount, you need to be very, very careful about your readers and your writers, specifically your writers, and so you don't have multiple writers to the same data. Things go bad in that case. In the case of active passive, I really recommend let Rabbit, let MySQL, let them do their active passive stuff on their own in their own way. Don't try and coordinate. You can very much try and coordinate them outside of those services. It's not a good idea. Lots and lots and lots of stories about how things like that have gone bad. All right, you think you've got the winning hand. You're all in. You're going to go active active if you want to drive 99 plus or so percentage availability, and you need to go fully redundant. Remember the complex HA reliability solutions, they might cost you more than you bargained for, so please be careful with the architectures of those. Understand each of the services and how they operate and how you're controlling them from an HA perspective. Active passive looks a lot like, excuse me, fully redundant looks a lot like active passive. The difference is instead of one active one passive, you might have three active. And essentially, the services are managed the same way. A lot more care and consideration has to be taken into controlling those services and ensuring the availability of them and that they all are coordinating and working. At this case, you might be chucking something like SethStore underneath the hood for you where you have a couple monitors, some OSDs or some SSDs for, say, your database, your high IOPS requirements might be backed on your high IOPS pools and then your glance images might be backed on your HDD pools where you don't need the high IOPS. So if we take that and remember earlier at the beginning of the conversation, I said, I'm going to kind of abstract this. This is a very basic slide that does not have a huge amount of detail because if you look at NOVA in the center of the screen here, NOVA itself has, does anyone know off the top of their head, eight or 10 ancillary services that make up NOVA? And some of those operate and behave differently than others. So understanding the individual components of each of those services that you're orchestrating and providing availability to is critical to success, whatever that success, that test driven architecture, that solution that you're putting in place from the beginning. So as you can see, it can get pretty complex very quickly in who's talking to who and who needs to talk to who and what's virtual IP services you're providing, which services point out those VIPs, where traffic is originating from and where it's going to. Those are very important aspects to that. One of the last solutions that we'll talk about is a distributed control plane. It's a fairly big departure from the traditional model. Your control plane is embedded alongside your cluster, where your VMs are instantiated, where the data and the work is being done. There are a lot of caveats and considerations that go into designing something that works that way. One of the benefits to that is if you get to the point where your control plane can be managed in that sort of utopian, self healing architecture, you have some really good solutions in place for managing the health and reliability of those individual services. You don't need those control plane puppies sitting on the side. You don't have to deal with them separately. You don't have to scale that infrastructure up. You can borrow compute and storage resources from your cluster as your cluster scales up. And you can distribute your control plane further across your cluster. Now a lot of you that may have been in architecture and compute, there's a lot of stories about the snake that eats its own tail as sort of the discussion in terms of you put your control plane on your environment where your data plane is. If your data plane has a bunch of noisy neighbors, your noisy neighbors are disrupting your control plane. Your control plane can't actually do anything if it's disrupted. It can't tell those noisy neighbors to shut down. It can't actively migrate those noisy neighbors out of the way. So at ZeroStack, we've done an awful lot of work around quantifying noisy neighbors, understanding the noisy neighbor problem. Thursday, we have a conversation with our CEO and founder, Ajay Guladi and Notre Kotarov, who's one of our interns who's done a lot of work for the actual testing and stress testing and load work. They're going to be discussing the noisy neighbor problem. So if you're interested in both just noisy neighbors and your cloud platform environment on its own as a separate discussion, go see that. If you're interested in how that correlates in terms of a distributed control plane, there's some good information there that you might be able to glean from that. Placing your control plane is very critical in understanding how busy individual hypervisors are within your cluster is very critical. And you're going to need some sort of distributed state service to orchestrate this. So you're going to need to be able to do something like service discovery. You're going to need to have some sort of distributed key value store or information store that you can share information across the cluster. Some of those solutions that you might operate with are EtsyD, Consul, Surf, and Atomics. I don't know a whole lot about Surf as one of those hash IO, hashicor solutions that has come out recently. Hashimoto, I forget his last name, is very big on about how Surf is so much better than Consul and EtsyD. And I don't know a whole lot about it, but it's worth investigating and taking a look at. Please do not use ZooKeeper. Distributed. Here's an example of a distributed control plane. Let's say you have a very small cloud that you're starting with. You have four nodes. In this case, and yes, the scale of my control plane, the green blocks, is out of scale with the VMs. Don't assume that it requires a great big mass of extra, extra, extra large VM compared to your VM workloads. But in this case, you're still going to have the same traditional components. You're going to have VIPs in each of those that correlate to the services. You might have a VIP per service as opposed to one VIP and a whole bunch of services under it. I wouldn't recommend doing that. I would suggest highly that you have individual VIPs per service. In this case, you can see that I have glance distributed across hypervisors 1, 2, and 4. I have neutron on 1, 3, and 4. You can distribute the workload across the cluster and rebalance the cluster based on your needs or your requirements. But we need to make sure that we adhere to that 3, in this case, tendency of active, active quorum gossip protocols need at least three copies. If you don't want four of these, that's a bad, bad. As you grow, it might look something like this, where you have your control plane distributed throughout the cluster. There are some very interesting things you can do with your clusters at this point. You don't have to worry about a whole lot of your control plane separation if you have a lot of good work that's been put into managing that control plane in autonomous nature. It can be very hard to manage if you don't have good tooling to manage that. We have a lot of experience with that at ZeroStack and how we've architected our on-prem cloud appliance device, which uses this sort of model. And you'll see that in a large distributed environment like this, the green and yellow sort of blocks or control plane services just waiting to do work. So if you find that hypervisor goes, oh, I have my numbers mixed up in there, too. Hypervisor 1 is replicated a few places, I think. Hypervisor 1 there in the center. If it were to die, then your control plane that's managing your control plane, so to speak, would then re-instantiate those services on a standby waiting VM service that's embedded throughout the cluster. That's my summary. Lunchtime is soon here. It's 12.44. I have time for questions. Some of you are probably starving and want to bail. But if you have any questions, please stand up to the mic, if you would. I've been instructed to enforce that. So I am very much happy to stick around here and answer any questions. Thank you very much, everybody. I appreciate your time. We are hiring as many people are. Check out zerostack.com careers. We also have a 30-day demo of the ZeroStack platform if you're interested. I'll leave the slide here for those of you that might want to grab the QR code if you missed it earlier on. Questions? Yes, sir? Do you have any specific recommendations when it comes to using the Galera, Multimaster, MySQL, or MariaDB extensions? Caution, yes, is the biggest one. I am a big proponent of both Galera and specifically Perconis extensions to Galera. We ran at Symantec everything with Galera. And it took us a long time to get things tuned on the MySQL layers to be reliable and not to eat data with Galera. It's very easy to get yourself into a state where you're causing data loss. In particular cluster initialization after some sort of outage as well. Nothing specific right now. No, yeah. Yes, sir? I'm not sure if this was pertaining to CIF or what you were saying that you need at least three replications, but not four. Can you explain that? Yes, Quorum protocol. So all the algorithms that exist that I'm aware of that coordinate cluster activities to do leader election followers and manage the cluster. We rely on what we loosely call Quorum. And that's where we have a majority vote. And to have a majority vote in the face of failure, you can't have four services. If you lose one of the services, you no longer have Quorum. If you have three services, you lose one of the services, you still have two, you have a majority in Quorum. And so the rule is for almost every service that implements any of these solutions, whether it's pacemaker, chorusing, etcd, raft as an algorithm on its own. If it's all of those services require odd numbers to maintain Quorum in the face of failures to be able to continue to re-elect leaders and maintain state and cluster health. Does that make sense? Yeah. And that's true with SEF. SEF monitors as well. SEF mons, you need to have one or three or five or seven. You can't have two or four. Any other questions? Yes, sir. Hi, I'm gonna ask the un-question. Okay. Hopefully it will be written out on a rail. So if I look at storage, there's NetApp. And their goal is to develop an appliance that never, ever, ever fails. So I would say that it has internal high availability. Yes, I've experienced many NetApp. You could argue that a lot of this presentation is about how do you achieve that for a cluster? Yes. The un-question is what if you just said, you know, my app doesn't really need that. My app doesn't even need SDN. I just need to be able to allocate a VM and give it a route to the internet. That's it. Does it make sense to have a hybrid design where you put the control plane up in Amazon where implementing scalability and high availability for those services is easy and have them control these dumb, just a bunch of resources in all clusters, especially if you have a lot of clusters. Does it make sense to have that kind of architecture? So I, without getting too deep, I would recommend you go, the article that I referenced from Jepson, and there's a last slide here in my presentations, references, it goes directly to that Jepson article. It's called The Network is Reliable. It's the name of the article. You can just search for that. Probably Jepson and Network is Reliable. And it is many, many, many pages of discussion about how networks have failed and caused services like these to split brain and cause data loss, corruption, all kinds of problems left and right. So I would generally caution against that simply because you no longer have control over the network once you exit your network. And most people who have operated in AWS environments have learned that you can't rely on AWS reliability. And you have to do a lot of interesting things to maintain high availability in an AWS environment. If you don't have the physical resources, a data center and the platforms that you need to run in-house to manage the control plane, you might look to outsource that control plane. I really wouldn't recommend it if you're trying to run your data plane in-house. I would keep them as close together. There are a number of services that are very sensitive to latency. So the latency and especially variable latency between your environment and that environment can be very detrimental to the service. In that case, I'll put my sales hat on. Not that I'm a sales guy, but zero stack. That's our solution was architecturally specifically to address that problem where you may not have a bunch of infrastructure that you want to manage and run. You just want to add water and poof if you have a cloud. And that's what our solution does. So you have your on-prem equipment and we have a very nice AWS-like hosted UI platform that allows you to manage and control that environment and delegate control of it to other domains and other projects and tenants. And to be able to do that full multi-tenancy model with our platform with the on-prem part of the data plane and the control plane embedded in it. So there are some other solutions. A lot of people are familiar with the nebulous solution which was a big failure for a number of reasons. Ultimately, they closed up shop. We follow a lot of the same model that they followed but we've done things a lot differently than they've done from an architecture standpoint. I also believe strongly that Nebula was just in the market with a solution at the wrong time and not necessarily making a validation on good or bad in their architecture but market dictated a lot of problems for them as well. Cool, thanks. Does that help some? Is this sort of a non-answer? Hopefully it gives you some information to take on. I just want to expect someone working at your company to give. In the last slide, you indicated tooling required for the distribution clustering. From your experience, what kind of tools may be required to manage a distributed cluster? So on the basic level, it's going to be the ability to manage the services themselves that you distribute across the cluster. You need to be able to monitor the cluster performance and health and metrics to determine what's busy and what's not busy and where you're going to place your control plane. You need to be able to distribute information that the control plane of your solution needs. So typically that's a distributed key value store of some sort that you might use to manage data about your cluster, information about your cluster. This service is here, this service health is this, those sorts of things. Coupled with that, you need to have a strong ability to be able to either live migrate services out of the way if you need to, or be able to kill a service and re-instantiate a service somewhere else where it's less noisy. So those are, there are a lot of tooling packages out there that can do these sorts of things. One example in the name just completely StackStorm. So StackStorm is a solution that started off of one of the M projects, not Manoska, not Manoska, but the Distributed Key Management Project. They took that and they abstracted out a solution that allows you to do workflow orchestrated workflows that have operations within complex systems. So if these things happen here and these things happen here and these things happen here, do this. And you can build workflows that operate across clusters in a complex environment. So their solution might be applicable for that. Zero Stack, we don't use that. We do our own thing. We have our own tooling that we've built from scratch and go like. Is it an open source tool or a street and house tool you have? Okay, I'm getting the chop my head off signal here which means I'm out of time. So I'm gonna jump off the stage here. And if there are any questions on the sidebar that you wanna shoot at me, I'm happy to do that. Sounds like my operators here are hungry too. Thank you everybody. I appreciate your time.