 All right, good afternoon everybody. There are a lot of people here. So I'm here to talk to you a little bit today about operational tools for OpenStack. I think over time, OpenStack has gotten much easier to install, kind of more of a solved problem. So I wanted to talk a little bit about more of what we do after the fact. I do want to point out that this is one perspective on it. There are many perspectives on it, and it's good to hear them all, take them in, and figure out which scenario works best for you. So I'm going to give you the Rackspace Private Cloud perspective on operational tools. And I'm going to start by talking about kind of our philosophy, how we approach the selection of tools, whether we build or whether we consume another open source project. I'm going to talk about a few of the tools we use, particularly around HA, because I know that's been a big topic here. I'm going to talk a little bit about OpenCenter. And for those of you who are unfamiliar with OpenCenter, it is the Rackspace Private Cloud software that we released back in March. So I need an HA monitor, I think. So we'll just hold here for a second, sorry. Thanks. So as I mentioned, I'm going to talk a little bit about our philosophy in selecting tools. We only consume open source projects. We have always taken the mantra that if it's not open source, it doesn't exist to us, because for our customers, that means that they're going to be paying for licenses, or we're going to be passing on licensing costs. We don't want to do that. We would much rather be able to contribute back to open source projects, make them better, and give our customers the option, or give them many options around open source tools. If we need to modify something, we send it back to the project. We never keep that in-house. We never, as often as we can, we don't fork projects. But when we do fork projects, we do make sure that we at least try to push that fork back. In most cases, they've been accepted. An example of this is the SOS report tooling that we use for our support team. Basically what that tool does is it goes and gathers all the logs and says, here's everything you need to look at for a support ticket. We added the OpenStack plugin for that so that we could gather all of the NOVA logs, the Glance logs, the Swift logs, put them in a tar ball and be able to send them to our support team. That's one example. There are many. Everything we do is open source by default. So we have not had a good reason to not open source something so far, and we've been able to interact a lot better, I feel like, with the OpenStack community by taking this approach. But it also gives other people the opportunity to come and look at what we're doing and tell us where maybe we need to adjust or where we're doing well. So here you can see there's two GitHub repositories, RCB Ops, which has all of our OpenCenter stuff, and RCB Ops Cookbooks, which has all of our Chef Cookbooks, which I'll talk about in a minute. And then lastly, everything must be automated throughout the deployment management stacks. This is a very hard problem. We want to get to the point where you literally type a couple of values in and everything happens. And we're almost there, but we, you know, environments change, you know, we've had a couple of problems where we still need more information than I think we want to collect. So that's the philosophy we approach when we select tools. But here's the tools we use. So for OpenStack itself, we build everything against the stable branch of OpenStack. We feel like at this point in time, while continuous integration and continuous deployment is great and it offers some real benefits, it's a really hard thing to manage when you don't own the data center or you are not deeply involved with a customer that's trying to use it, right? I mean, this is your infrastructure. This is what all of your applications are going to run on. So for us right now, it just doesn't make sense to do CICD, which is what we would need to do if we were going to run trunk. Trunk changes very rapidly. You need a quick feedback system on what's broken. Are those features you need? Is it going to break? Should I upgrade? Should I not? So for now, we're staying on the stable branch. We will continue to examine that as the code base continues to mature. Software configuration management. In NOVA, there are somewhere around 2,000 options to configure NOVA. About 600 of them are useful. But we can find very quickly that you're going to need some way to manage this across all your nodes, because your NOVA comp, your API paste, your logging comp, your policy JSON has to be on every node. And so if you're not running some sort of configuration management system, you're going to be SSH'ing a lot. And you're going to be typing the same commands over and over again, and that's just not helpful. So we do recommend using some sort of configuration management. We have chosen Chef, and I'll talk about that in a minute. But there are many options in the space. Puppet has a great set of Puppet manifests that they've made open source and available. Saltstack also has some libraries for OpenStack, as does Ansible. So there's a lot of options in this space. We chose Chef, because early on and still today, there has been a lot of community support around the OpenStack work going on in Chef. So we've been able to partner up with Dell and Opscode and others. Dreamhost is another one. And have really good conversations about how we should configure OpenStack and how we should manage that configuration. We also chose it because it has, I'm going to say this, at the time it had better integrated search, we haven't gone back and taken a look at everything else. There's nothing stopping us from doing that. But at the time, it had better integrated search so that we could search for nodes as they were coming online and say, this is what those nodes need to do, and it doesn't have to be predefined. So I can do a lot of Ruby searches and different things that I couldn't do in other places. And then lastly, it was semantically closer to Ruby than the others. Or then really, we compared it against Puppet at the time. And I know that's changed a little bit over time. We do have cookbooks for every one of the core services. As new core services come on, we add cookbooks very quickly. We keep those maintained through releases. So we're going through a cycle right now where we're upgrading everything from Folsom to Grizzly. So that is our maintenance path on those. And again, we do make them available. So if you're running Chef and you want to check this out, you're more than welcome to. All right, high availability. We chose a high availability architecture that doesn't use shared storage on purpose. If you go back to one of my earlier comments around unless you're deeply integrated with the customer, your options around shared storage are basically unlimited. A customer can choose any shared storage platform they want. And we wanted to give people as much flexibility as possible without tying them to a specific shared storage architecture. So we had to make some trade-offs. We're not, we've all heard it a lot around Corosink and Pacemaker. We're not using Corosink or Pacemaker or DRBD. And I'll talk about the reasons for those decisions here in a minute. So we did take a slightly different approach than I think what we hear most of the time. But there's really, when you think about OpenStack, all of the API services are stateless. They don't really make sense to be in an active, passive model. So there's really only a few services that do. And that's around the database, the message queue, and Glance. So if I go back to the last talk for those of you who are in here, we're talking a little bit about Glance images. If I'm going to do HA, I need to make sure that if I lose that primary node, all of my images are in the next location so that when I flip over, everything's still there. So from my SQL, we chose to do MySQL Master Master. I don't know that this would be our long-term solution. It is for now. We did take a hard look at Galera. And basically what we learned was that the number of nodes required to run a Galera cluster was greater than what we wanted to increase our hardware footprint by. So you're looking at least three nodes in that case. And they're all dedicated to just doing database. So that wasn't really something we wanted to explore at the time. So we went ahead with Master Master. So we use RabbitMQ for our queuing system. This is fairly standard in OpenStack. I think most people still do, though. I did hear a little bit about Cupid and ZeroMQ this week already. So it's good that we're going to have options around queuing. For now, we're using RabbitMQ. There's a problem with RabbitMQ, though, in that mirrored queues weren't available as configuration options in OpenStack until Grizzly was released a couple of weeks ago. So we had to make a choice. And the choice we made was that we expect that the message queue will fail. And when it does, all of the messages in that queue will be lost. And the reason why we made that decision is most of the messages in that queue, short of creating networks with something like quantum, can be reissued. And what we've found in working with customers was that if I issued an API request to create a server, and that server didn't come up, the behavior was, I'm going to click the button again. So as long as I fail over quick enough and I'm ready to take that second hit, it's almost not noticeable. There is a little bit of a challenge here. We did have to change the TCP Keep Alive in Ubuntu and CentOS, I believe, because Rabbit and Nova were hanging onto their connections for too long. So we had to drop the TCP Keep Alive's way down to be able to do that quick failover. That was something we wouldn't have been able to do otherwise, so that when the second push came through, we would be ready. So for glance images, we also had to build our own here. We took a hard look at Glance Replicator, which comes with OpenStack, it comes in part of the Glance package. It's a great tool, but it assumes that you're going to have two completely standalone Glance servers. That was not something we wanted to do. We had looked at Glance as an active passive model, so we are using a little bit of an R-Sync configuration to make that go. But we do do a look up against the Glance database. We see what images are there, and we only sync the ones that we need to when new ones are created. We don't constantly just transfer the images back and forth, because that would be silly. So again, that was one we had to write our own. It's open source. It's in our, I think it needs to be moved into our RCB Ops repo, but it will be there soon. All right, lastly, as I mentioned, we don't use, well, this isn't last, but we don't use Coro-Sync and Pacemaker and DRBD. We found a great tool called KeepaliveD that uses VRRP. For those of you who are unfamiliar with the virtual router redundancy protocol, this was a new one to me about six months ago. But it's a really good tool that provided a couple of benefits when we had examined it first. First of all, the use of VRRP allowed for very quick failovers. It also had a load balancing service in there that we thought we could use. The problem with the load balancing service, though, was that it couldn't run in one arm mode without setting the gateway of all of the compute nodes to go through the KeepaliveD server. And you can imagine the throughput on that would be not so good for the KeepaliveD guy. So we went ahead and we're using HAProxy for our API services, and I'll talk about that in a second. But primarily, we use KeepaliveD for MySQL and RabbitMQ. And this is your traditional IT active passive failover, right? If all of a sudden I start losing ping on one node, it flips to the other. And when it does that, all the services are already active and running because we've already run the Chef configuration there. So it's very nice. It's very seamless. The other reason we didn't choose Chorusync is Chorusync, from our perspective, if you go way back to my philosophy slide, was not easily automatable. There are a lot of manual steps in there. It was also something that took four or five additional IPs as opposed to one or two. And so our decision point there was to go with something that we could fully automate and that used less resources on the front side. So for the API services, and I kind of touched on this before, but we use HAProxy in front of our load balances, in front of our API services. And we basically say that those API services can live anywhere in your private cloud. So at any point in time, I could take one of my compute nodes and put the Nova API there. And I could go to another one and put the Glance API there. Those are stateless. Because they call back to the database, there's no reason that they need to be on a single centralized node by themselves and in an active passive model. So it's a slightly different architectural conversation than we're used to in IT in that we're now saying that we're going to solve the HA problem with scale. So if I lose one of my APIs, it's OK. My load balancer still has four or five left. We'll be all right. And I can throw another one down just in case. So that's where we're at with the API services. I do want to touch on Swift briefly. So the main operational task around Swift is managing the ring. So we took an approach with the ring that allows us to set up a Git repository on the initial install of Swift. So every time one of our Swift nodes checks in, they can check out that copy of the ring and apply it across the cluster. That makes it easy to roll back changes. And then lastly, as new drives are detected in Chef, they're added to the registry. And then we update a script with the suggested change. Now we'll say we're not at the point where we feel completely comfortable with Swift automation. Swift is a very big thing. Damage to the ring causes damage throughout the system. And more than likely, you've just lost data. So we've approached Swift very carefully, and very much with kid gloves on, in that we don't want to break anything there. Testing. So testing is a little bit different for system administrators. I think we've built in a lot of Jenkins automation around our scripting, around everything we do such that we need to check syntax. We need to check functionality. We need to make sure that APIs are working. We need to make sure that HA is working. There's a whole series of things you want to do that you probably didn't have to do previously with doing an IT deployment. There wasn't a real big Jenkins job to go and do that. So we do. We go through and we do all of our unit tests. We do full deployments of OpenStack. And then we go through and check each of the APIs and each of the command line tools. And this happens every time we commit to any one of our scripts. So anytime we make a change to the automation, we do this whole set runs, which can take a while. So we have some fairly long runs that will take up to two hours to get through. And we're working on cleaning those up. But we do make that publicly viewable so you can always see where we're at, where we're testing, and how things are looking. So I'm going to break this up a little bit because there's a lot of people here. I'm going to talk about OpenCenter. But are there any questions on the first part? I'm sorry, I can barely hear you. It's not a full proxy. So the question was, are we using Keeplife D in a full proxy so that all traffic has to go through Keeplife D? The answer is no, we're not. Keeplife D is only there to manage the failover of the service, basically the VIP between the two hosts. Any other questions? Yeah, in the back. So that's a good point. I actually left Scheduler out of Keeplife D, and it is in there. There's a gentleman right here that can answer that question. He spent a lot of time in the Scheduler. I did not, but he can answer that. I'll call Nova Scheduler. So here, let me answer, let me repeat the question real quick. The question is, how do I manage the non-REST-based services that, like Nova Scheduler, that are kind of out there and that you have to really care about? And the answer that was given was, just run multiple Nova Schedulers. And then they can all accept requests and schedule appropriately. Yeah, Nova Network, we run on every compute host. So that's how we spread that HA out there. What else? Any other questions? Oh, yeah. So let me make sure I understand the question. How do we distribute the images across racks? Do we load bounds at the top? The answer is, we actually let OpenStack handle that for us. So if I upload an image to Glantz and compute nodes need to spin that instance, the first time they'll download the Keeplife file, store it, and then from there on, it acts as a cache. We don't do anything to precede that cache up front. We just allow OpenStack to kind of naturally deploy it where it needs to go. Yeah, you're still going to have a bottleneck at the top for the moment. Yeah. So I'm going to talk about OpenCenter. And I'll again, I'll have some more time for questions at the end of this as well. So we put all this together. And we said, we're using Chef and KeeplifeD. We're doing all these things. We'll make our cookbooks publicly available, and we'll kind of see how this goes. Very quickly, we learned that in order to do this, there's a lot of Chef expertise required. There's also a lot of OpenStack expertise required. And many places just don't have that yet. I mean, it's coming, particularly around the OpenStack expertise. Some places are puppet shops, and so they don't know anything about Chef. And so we wanted to make this as simple as possible. So we wrote an abstraction layer that allows us to do somewhat complex things now, and more complex things as we go forward. It simplifies and speeds up the deployment. It lowers the overhead. It minimizes, again, the need for internal OpenStack or Chef knowledge. And it has an API and command line. And it also has a very nice UI that we spend a lot of time on that allows us to drag and drop things into where they should be. So really, the idea with OpenCenter is to make this as simple as possible and abstract everything from the software configuration management and OpenStack down. So it specifically targets the deployment of hypervisors. We use KVM and Private Cloud. The ability to manage OpenStack controllers in our world and controller is where all of the OpenStack services run. So in order to get to an HA model, we've come up with this controller model where all of our OpenStack services run on two hosts. And that includes our database and everything else. And over time, that'll probably split out as we see more scale deployments. We're not trying to cover any of the areas that OpenStack is specifically. So if you look at projects in multi-tendency and managing images and templates and deploy virtual machines, again, we very much lean on OpenStack to solve that. And for the most part, they have. And there are also already dashboards for that. So if you get a horizon, there's both a user or a project dashboard and an admin dashboard. So this was kind of already a solved problem. These are the components of OpenCenter. And I'm not going to talk through all of these. But there's a couple of interesting concepts here. This concept of a solver is basically to take a series of instructions or a series of requests, determine the path to get to the state that you want to be in. So if I say I want to become an HA controller and I click the enable HA on the container and I drag a note in, how do I make that secondary note a controller? Well, this will go through and determine the set of roles and recipes from Chef that need to be applied in any extraneous work that needs to happen around KeepLiveD if there is any. And then present that path back. Now, if it doesn't have all of the information it needs, it's going to prompt you. And it's going to say, well, I don't have, in the case of HA, I don't have the VIPs. Can you please give me the three VIPs that I need to do HA? If it does have all the information, it'll present the solvePen back and execute that solve plan. And it'll put the nodes in the state that they need to be. Now, the other thing we found is that, while Chef has a really good state management system, we needed a state for the entirety of the cluster, right? Not on a per node basis or not on some parts of it. We needed to be able to say, this is an OpenStack cluster and I need to know the state of the entire thing or this is multiple OpenStack clusters and I need to know the state of each one individually. So that feeds into the solver. So the solver will check into the state machine and say, what's the state of this node and what do I need to do? That's how it comes up with the solve plan. Now, what's not depicted on here is a couple of fun terms for you. We have a system, we have what's called the adventurator. It's a choose your own adventure system. And basically, these adventures are Python or bash scripts that execute some operational task on a node. So for instance, if I want to update OpenCenter, there's an adventure for that and I would click on the adventure and it would update OpenCenter and download new cookbooks and do whatever else I needed it to do. So that's the OpenCenter system kind of at a glance. There is an agent that runs on every node and that's how the communication happens between OpenCenter and the nodes themselves. So here's a couple screenshots of the UI I mentioned before and here you can see, on the left, you can see all of my available nodes, service nodes. So I have an OpenCenter server and I'm assuming that other one is a Chef server. Those go into that service container which is kind of a specially controlled service, specially controlled container that you can't just drop anything into. You know, once you drop it in there it's gonna have, it's actually gonna be moved in there when you run certain adventures like install Chef server or when the OpenCenter server is installed it automatically moves itself into there. Now you can see over on the left-hand side I've dropped down a cog that says create a cluster. Over here it pops up. Again, this is the solve plan in action, right? So I don't have any of this information. I don't know, as an OpenCenter, I don't know what the public network sider is and I don't know what bridge interface to use. So I don't have any of that information so I'm requesting it from the user. Now, we turned HA into a two-click thing and I wish it was that simple. But basically on the container you can see there's a little cog there, you would drop that down and click enable HA on the container and then at that point I would be able to take any of my available nodes and drop them into the infrastructure bucket. In this example there's two nodes that is currently logged down by OpenCenter. The plan though is to extend that out so that you could have more than two nodes and spread your HA out if you so desired. So again, if I talk about the adventure system we're operating or we're automating operational tasks and these are a combination right now of our support team working with customers and giving us feedback and then experience that we've had in running OpenStack for customers. We've taken those and we've turned those into Python and Bash scripts that we can execute on any host and keep the whole thing in sync. And lastly it's extensible. So it does have its own language, which is great, but it does look very similar to JSON, which helps. But you can see at the bottom I can put some criteria. So in this case if I don't have a, or if I do have a Chef server and my ID equals one then I can say that I can create a cluster, right? If I don't have a Chef server, I can't create another cluster and this adventure is gonna fail. So that's OpenCenter at a glance. We have a few minutes left, so any questions? We do, so we were very heavily focused on correcting some issues we were having on getting compute deployed. So we will circle back and do Swift integration as well. No, those are bare metal nodes and that's actually a really good question. So the reason why we went the OpenCenter route is I don't know if any of you are familiar with the Alamo ISO that we released last year from Rackspace, but that thing, or the ISO itself, you had to put into a server and you had to boot it and then you had to go to the next one and you had to boot that one and it was really painful. So the original idea around OpenCenter was let's just make this a curl bash script or a package that you can install as part of your install, right? So if you already have a pixie system in your network, you can very simply just put that curl bash at the very end of your preceded line and then on every node you'll get the agent, right? So that was why we went that route but those are bare metal servers in this case. Other questions? Yes, right. So the question is how are we breaking up our infrastructure for private cloud? As opposed to public cloud? So there are actually two entirely different offerings at the moment. So we don't use any of their infrastructure. Every customer would come to us and buy X number of servers and we would put private cloud on that. We are for the moment. The secondary question there is you're basically isolating every OpenStack deployment. They're not federated together or federated to our public cloud. The answer is that's correct. So that's a great question that I don't have a good answer for you on, unfortunately. The question was with the cell concept, is there future plans for us to plug into the public cloud? So that one I really can't address. Other questions? All right, so the question is for our, basically our conductor services like Glance and those, do we run those in virtual machines or do we deploy them directly onto the metal? We can go either way. By default, we would install them outside of a VM. So they would go directly onto the system. Now we have had, our support team has done some deployments where they have put Chef server in a VM and they've put OpenCenter in a VM and run all the NOVA hosts or all the, yeah, all the NOVA controller services on the box itself. And that kind of breaks up the pieces then. So the follow up question there is any preference between VMs or if performance is better going one way or the other? Not on a small scale. So the one thing that I'm always very cautious about particularly when I'm talking about putting all of those services in a VM is I'm reducing the number of resources I have available. So if I have a large private cloud and I'm getting hit a lot with API requests or moving images around, that VM's gonna get very tired. And so I would rather have that running on bare metal in that situation. Now if it's a small private cloud there's nothing stopping you from putting it in a VM. It's just one of those things you have to consider the size and how much traffic and how much talking back and forth they're gonna be doing. Other questions? Yes. So quantum exists in our QuickBooks. It is not deployed by OpenCenter today. Yeah, I think we were just referencing every available API service when we put that document together. Yeah, sorry. Okay. So not for metering or metrics gathering and that in the ceiling meter fashion. We use Colleague D inside of a private cloud to gather all of the host information. So Colleague D and Graphite today are deployed with OpenCenter and that's all set up. We are taking a hard look at ceiling meter right now and seeing how we plug that in. But we don't have a good answer for that today either. Yes. So the question is do we have any capacity management tools so we can see how long we can last? So OpenCenter will give you a view of that, right? But it's not a full view because it doesn't show you all the VMs that are running on it, what the resource limitations are. We're gonna be presenting more information through that UI over time, things that are coming from Colleague D and things that are, so we have that, but we don't have a real good capacity management right now, hopefully. Other questions? All right, well, thank you guys very much. Thank you.