 Good afternoon. I am going to talk today about extending OpenStack Ansible with automated operational management. I am William Irons, a advisory software engineer at IVM. Been working in cloud for the past about eight and a half years, but I've been working with OpenStack for only a year. Probably less than that. And this is actually my first conference. So today, we're going to talk a little about the background and what we're doing. Why we need operational management. Going to talk about OpenStack Ansible, how it can be extended. If you haven't used OpenStack, it's a great tool. OpenStack Ansible, it's a great tool for deploying your cloud. I like to think of it as the rocket-margarage commercial push-button-get-cloud. It's really easy for us to create a cloud instance. We're going to show the operational management solution we created using Nagios for monitoring your cluster and using the Elk classic stack for log analysis and trends of your cluster. So we're going to show it at the end. We're going to show a little demo. So we can go into the background of what we're doing. So if you haven't used OpenStack Ansible before, you should really give it a try. It's a, the main goal of it is to provide a good system install of OpenStack. I mean, there's a lot of products out there for me. OpenStack Ansible was the easiest to consume, I would say. If you haven't tried it before, you can actually create your own about to a 1604 VM. You can go out to the QuickStart web page and run a few commands, and you'll have a cloud in your VM. So it's very easy to try. Within a few hours, you can have a cloud up and running and get the experience of what it's like. If you haven't used Ansible before, it's pretty easy. So what we liked about OSA was that day one deployment. You click off a thing and your day one deployment's done. But then after that, what do you do? Day two forward, how do you monitor it to make sure it's up and running? How do you know the health is staying consistent? When are you going to run out of resources? And that's what we looked at. How can we monitor? What value can we add to it for our customers so that they can monitor the solution after it's been deployed? A lot of our solution was borrowed from Rackspace. They gave us a head start. They had already extended OpenStack Ansible to install ElasticStack on an OpenStack cluster. So we took that solution and kind of expanded on it. Just some background of what we're doing. So I'm part of a larger team that's focused on getting the OpenStack cloud toolkit for open power. So we have a bunch of components of that team. I'm an operations manager team, but we have a team that does hardware setup, so bare metal provision of the hardware. We use Seth for block storage, Swift for object storage, of course, Nova and Neutron for private cloud, Trove for database as a service. And while I'm focused on making sure the OpenStack runs on power, while the demo today works on x86 and power little indian, and it's a bunch of 1604. It's using the Newton branch of OSA. Everything I demoed today, I verified that it works with Okada, although it's not officially in our toolkit yet. There's a few bugs with the new version of Ansible for Okada. But the concepts for how you extend OpenStack Ansible apply. And right now, for OpenStack Ansible, Newton only supported Ubuntu 1604, so that's what we are supporting too. So what is OpenStack Ansible and how can it be extended? So Ansible, has anybody not used Ansible or not heard of Ansible? Ansible, it's like Chef or Puppet. It's another open source automation platform. The big difference between Ansible and the other ones is it's agentless. It doesn't require anything to be installed on the endpoints. It uses SSH to configure everything. And OpenStack Ansible provides Ansible playbooks for the deployment and configuration of OpenStack. The big thing that OpenStack Ansible does is you don't have to do it this way, but the default settings, I think, with OpenStack Ansible, are it creates LXC containers for every service. So on your trolling node, you have a container for Glance and for Cinder and for Horizon. On the compute nodes, I'm not sure if you have the Nova Services running on bare metal. I think you have Neutron running in a container, but it makes it easier when you have containers to separate your logic. And the other main thing it does is it has HA proxy, which forwards the quest from the front end and from the floating IP. So it acts as a low balance between multiple controllers to the back end, the back end being the LXC containers that are on a private network. So HA proxy has a dual purpose. It does the HA high-level low balancing and also has the stuff that takes the request from the front end to the back end LXC container. So why would you extend OpenStack Ansible? The reason we're on a call, we want to automate something that there's manual today. We want to do it in consistent manner, so there's less user intervention, less chance of failure. OpenStack Ansible is really easy to install. You want your stuff to be installed with it. Anything you're adding, any value you're adding to your cloud, you want to be doing the same thing. So you want to install in the same manner. So if you install 10 clouds, they're all installed and configured the same way. So to extend OpenStack Ansible, there's four main things you need to do. And these are actually really simple as I'll show you in a minute. We need to create LXC containers for additional services. So in our example, we're installing Elastic Search, which we're installing Elastic, Kibana, and LogStash all in three separate containers. And we're installing Nagios in the fourth container. You want to write Ansible Playbooks for installing customer services and configurations. So these are Playbooks to install, LogStash or Nagios, or whatever you guys want to install for your customers. You need variable files for user-defined variables. So things like the passwords to the systems you want to use, or configuration settings for whatever you want to use or configure with your things. You just need to create variable files. And HAProxy configuration for accessing the service. So you need to tell the front end how to access your back end. And actually, most of this is just a few lines of code. So if you want to create LXC containers for additional services, all you need is the file I show here. So under Etsy OpenStackDeploy EMV.D, you have to create a YAML file that talks about the container you want to create. And it looks kind of confusing. It took me a long time to get me head around exactly what it's doing in this small little file. But in the first section, the point of the skeleton is saying you have a component. This is the elastic search, single elastic search container. You have a component elastic search that belongs to Elastic Search All. It belongs to is mainly for Ansible dynamic inventory so that when you create a role in the next step, you want to install, when I say I want to install this, I want to install the code. I want to install Elastic Search and all the hosts that have the Elastic Search All tag. The container skeleton says that I have Elastic Search container that belongs to log containers. And log it, and I'll skip that for a second. It contains Elastic Search. When it contains Elastic Search, it goes back up to the component skeleton. And that's the kind of association between the container skeleton and the component skeleton. The belongs to log containers tells OpenSec Ansible to install this on any host that is identified as a log container. So in the default OpenSec Ansible setup, log containers are associated with the controller nodes. And log containers was originally defined by RSS log. So wherever the RSS log container is installed, the Elastic Search container will be installed also. The properties are an arbitrary name value pair. I'm not really sure what they're used for. But I've just been copying the other templates and just saying the name of my service that's running inside the container. So I didn't put in here the physical skeleton. If you do a physical skeleton, if you wanted to install these on a completely different box from everything else, you could say belongs to Elastic Containers, then have a physical skeleton that says my Elastic Containers will be on Elastic Coast. And then in an additional YAML file, you need to say which host is going to host those. But once you create a file, that size, when you run the setup host, you have a playbook. So either from the start where you have an OSA, it hasn't even run yet, or if you run it again, OSA will go out there and look at the directory. It will see that file. It will say, oh, I need to create a container called Elastic Search on all my controllers. So you don't need to write your own code to do it. Obviously you have your own option to write your own code. We actually done that because we wanted our playbooks to be able to run without OpenStack. And if you do that, you end up with a lot of hassles, like having to ensure you don't conflict with IP addresses that OSA uses. So I would say it's much easier to just create a property file and have Opasack Ansible create your containers versus doing it yourself. The Ansible playbooks themselves are pretty easy if you know how to automate what you want to install. So Ansible playbooks are necessary for installing any customers, any service you want to install. So you create a container, and now you have to install the things on it. It could be as simple as doing an app get install and writing out a configuration file, restarting the service. It could be as complicated as if you have a program downloading the program, downloading, if it's a Go program, downloading Go, Go, compiling it, and creating the service. It can be as simple or as complex as you want to. Any, anything you can automate with the command line, you can automate with Ansible. Once you learn Ansible, it's pretty easy. It's just a YAML file, and it's a really descriptive language. Where you play some doesn't really matter, because you're going to be calling the playbooks from whatever directory you're running them from. There are some stuff that comes with Opasack Ansible that you can leverage as necessary. If you want to use an existing Ansible role, the Opasack Ansible roles are defined under Opasack Ansible playbooks roles, or anti-ansible roles, or the roles for installing Horizon Glance and all the component services. The library points to the different plugins that Ansible includes, Opasack Ansible includes. I haven't really leveraged them. But the inventory file, inventory one, is the important one. When you run Ansible without specifying inventory file, if you have the Ansible config file, it will automatically kick off that inventory, which is the Opasack inventory, which will list all the containers, including the elastic search container we created in the previous step. So if you have a playbook that says, I want to install this on all elastic search host, it will tell Ansible what containers those are, and Ansible will go do that. When you create the playbook, you run the playbook using either the Opasack Ansible command or the Ansible playbook command. Now you may ask, what's the difference of those two commands? It's the next page. So as I mentioned, you have these user defined variables for wherever you want your user to customize with your playbook. The difference in these two commands is if you use the Opasack Ansible command, it will read in the variable files matching the name format at the Opasack user splat.yaml and can call Ansible playbook. So it doesn't do anything really magic. It just reads in the variable files that you create and passes them to Ansible playbook. Ansible doesn't recommend overwriting the existing user variables and user secrets. Don't add things. You can change things, because you can change the values. That's how you change the default install of Opasack Ansible. But they don't recommend adding new things to those files, because when they do an upgrade, they're going to wipe out the file and replace it or make their own updates to it. So it's safe to just create brand new files. If you don't want those variables to be included, you can just call Ansible playbook pass the dash to E option, include your own variable files. So with that, once you create your playbook and run it, then you have your service installed. You have your container. You have your service installed on the container. The last thing you need to do is define HA proxy for accessing the back end. So you have the front end here. You have the containers running on the private network, on the host. You need to associate the two. And all you need to do in that case is create a variable that explains your additional services you're adding to Opasack Ansible. So Opasack Ansible already has a variable HA proxy services that's internal, that defines all the proxies that they want to configure. The added to this additional variable, HA proxy extra services. So you can define additional proxies for them to create. And these are the minimal configurations. Pretty simple. What's the service name for your HA proxy? What's the back end nodes that it should map to? So this is where we use that last extra variable that's going to replace that with all the containers, the container on each controller. The port that you want to proxy. So here, the last search works on port 9200. We're proxying it from the host to 9200 in the back end. There is a back end port. If they're different, you can specify that as a variable. And then the balance type, or this is actually inside of HA proxy, it's mode. It's not the best name of the variable, I think. But are you going to go, is the port talking TCP, UDP, HTTP? This is where you specify that. So there's a lot of other variables available. The best way to look at this is actually look at the Opasack Ansible code and the template for configuring HA proxy. I don't think all the variables are documented at all. I took it as a to-do to try to document the variable, got to work with a team to get the variables documented. And then once you create the variable file and you run the HA proxy, install yaml playbook, HA proxy will configure, it will automatically configure HA proxy. Then you'll have everything set up. You'll have your container, everything installed in your container, and HA proxy will be for in the quest to your back end container. So it's pretty easy to extend Opasack Ansible. And it's a lot of configuration and not really much code. The code is just what code you want to add for installing your particular service. So what we did with operational manager, I'm going to go in here, we did three main things of it. We added a horizon user interface extension, kind of a dashboard. I'm actually going to go straight to that. A dashboard to keep track of inventory. System admins have a need to know where the hardware is located, what rack, what machine type model, what firmware is running on it, what operating system is running on it. They need to know how to find the rack if there's a problem. And not all the information is readily available in the Opasack user interface. So we wanted to add a physical location to see the physical structure of your rack. And from this interface, this is a launching point for future enhancements. So we want to provide the ability for a user to add or remove nodes from a cluster to perform firmware updates, take a node of the cluster down, do updates, bring it back up, that those maintenance of the cluster over time. We added Naios to monitor the cluster. And there's a bunch of open source tools that you can pick. Naios has been around forever. Xabix is a popular one. We chose Naios because we had a lot of experience with it. It's been around for a long time. It's open source. You can do all the configuration via config files. So if you want to answer a playbook so you want to write out what you want to monitor, it's very easy. You don't need to go into a web UI. And our goal with this was to be able to monitor the Opasack cluster of Naios and not have the user have to do any configuration Naios. It would just, everything is automatically set up so they can just, you know, they run the scripts and Naios is monitoring it. They don't need to go and lock with Naios config files. Our solution is a stencil in that we are, you know, we can't allow other things to be dropped in. I want to add Xabix, so I'm not quite sure how to add Xabix. It's something we need to research. It's a popular one that I would definitely like to add. This chart kind of shows how Naios works. It's a polling mechanism where Naios is running on the controllers. It calls a command line, check in RPE, which calls a agent running on all the endpoints, the Naios remote plug-in execution agent. And depending on what endpoint it is, it can check and see if a process is running, check the load on the server, check the Cep monitors. You can write any, you know, if you can write a command line that returns zero for OK, one for warning, two for critical, then you can monitor it. So that's something we liked about Naios was that, you know, you can write any kind of plug-in you want to extend it. For log analysis, we, you know, as I mentioned before, Rackspace with their RPC OpenStack and GitHub had already extended OpenStack Ansible to install the elastic stack. So we kind of piggyback on them, but it is a very popular open-source log analysis tool. It has high availability and low balancing built into design, which is important, of course. We have three controllers. We want it to, you know, stay up if one controller goes down. Naios doesn't have high ability to build into it. It's not a reason I don't particularly happy with it, but we have done some hacks to try to get it to be high availability. So what the OutStack really helps us with the elastic stack is you can visualize your data over time. You know, the log information, depending on what kind of logs you have, Glance and Key, let's see. Glance, Trove, a lot of the APIs have response times. You can see, you know, what's the response times of my APIs that work trends. Whatever information in the log, Cep has read, write, operation per second, how much space is used. Whatever information is in the logs, you can visualize. The last poll here kind of says, if it's, you know, we can't visualize anything if it's not in the log. So I was kind of, it's hard for me, as I don't know every part of OpenStack to say what's most important to monitor in the logs. So we tried to pick and choose. But I think, you know, there is a discussion tomorrow about someone else's experience, a lightning talk, about someone else's experience using elastic search. And I'm curious what they found in the logs to monitor because I'm always interested in improving what we monitor in the OpenStack logs. And obviously, the more stuff you log, the more you can visualize. So this graph kind of shows that while Nagyos was a poll mechanism, checking for health every 10 minutes or so, log analysis or elk is a kind of a push mechanism. So we have metric beat and file beat and start and end points. Metric beat does, gathers data, gathers data from commands like top system data, like CPU usage, memory usage, process usage, a ton of data. And it sends that to elastic search. File beat takes the OpenStack logs line by line as we're into the logs, sends that to log stash to be parsed in the individual fields, and then sends it to elastic search to store. Elastic search is the main part of the elastic stack in that it stores all the data and provides APIs for querying the data. Kibana is the front end that visualizes the data. So in this picture, log stash and kibana are stateless. Elastic search is the one that stores all the data. And with elk 5x, a lot of the log stash function has moved into elastic search. So we're actually looking at dropping log stash in a future version if we can. This is an example of what a log looks like and what you've got to parse and how sometimes it's not completely easy to parse. So here we have the top four lines are what we get from file beat. So we got a line, a log line from file beat that's from host integration rack three controller one. They get the long message that came in the string. We got the source, which is the name of the log file and the tags. With file beat, you can say this log file has these tags so we can query on it. Give me the Ceph logs. Give me the Ceph monitor logs, the house of visualizations. And the part that's highlighted in green, we broke down and we break down using regular expressions, groc regular expressions, you break it down into individual fields. So we can say the keyword available, a number, percent. That number is available percent. And then total number and space, word comma, that total is the number is total space. The next thing is total units. And then one of the first things we ran into when we did Ceph was that Ceph will actually log different sizes. So if you start with a small cluster, the space used may be bytes, kilobytes, megabytes. And then it grew up to gigabytes, terabytes, and petabytes. And if you just take those numbers and graph them without looking at the unit and the measure, your graphs are completely useless. So we have to do some conversion here that says the unit and the measure is gigabytes translated to megabytes so that we can have consistent graph. And LogStash gives us that ability to do that. If we captured the units, then we can do that conversion. So I'm actually going to switch now when I want to demo this. So maybe that will kind of help visualize everything. Unfortunately, I didn't plan on using the smaller resolutions, so I'm going to have to make do here. So this is our Ryzen dashboard. I am VP and then to IBM and to the Austin lab where we have a test environment. So we added this inventory dashboard that kind of shows our rack. And we have three compute nodes. And I believe, let's see, if I can, I don't know if I dare shrink this, or is that too small? Not much around that. We have three compute nodes, three controller nodes, three storage nodes. They are running the Bantu, they're all power nodes. And from this, I mentioned this is our large knockpoint. We don't have it functional yet, but if you wanted to add a new system, you could add it here. If you wanted to remove a system, you could do it through here. Removing a system would involve getting the workloads off or at least preventing new workloads from going on it. And of course, it depends if you're moving a compute node or storage node, the different actions you've got to take. Added a system would be provisioning the system to be included into the cluster. The field called EIA location, you can edit. We don't know it by default when we install it, but this is the location of the machine in the rack. So for a system admin that has to go replace a machine, do something on a machine, it's important to know where it is. We do have a rack detail section up here that you can edit. They say what they want to call the rack, what data center it's in, what room it's in, what role in the room it's in, and any notes about the rack. So that, again, the admin can find the rack. And we have plans to add additional racks. That's why it's kind of a tabbed interface, but we don't have an add rack button here yet. And then for the launch to Nagios or Elk, you can go directly to the URL, but we also have the launch point here. If we click on monitoring, we can launch over to Nagios. And we were up to date until today, I guess. So this is Nagios core out of the box. I'm really unmodified, but it's already been configured for the user. So here's the host that we are modifying. We have three compute nodes. So I mentioned two controller nodes, three services. We have local host. I'm actually going to get rid of that. It's Nagios monitoring itself in the container. It creates it by default. Not very useful. And then for services, the thing I wanted to show about services is you would monitor different things, depending on what the object is. So we have the compute nodes. So we monitor the OpenStack compute nodes. And if we drill down in this, you can actually see we are monitoring three different things when we monitor OpenStack compute nodes. We're monitoring that. We have Nova running. We have LiverD running, and we have Neutron running. On the controllers, we kind of monitor each container separately. We don't monitor every single check in a controller as one check. We kind of lump them into, for glance, what's all the things you can check with glance? We put them into one check so we don't overload the user with the hundreds of checks. But this kind of shows all the different things. If there's a problem in Keystone, it would show up red here. At the top, you can see there are OK. If I had time, I would go kill Keystone. And we would show it turned red. So you can easily, I want a one quick glance, see what your problem is. It has all the features of non-gills. Because we're running out of time, I'll quickly switch to Elk. So from the same view, you can launch to Elk. This is our default dashboard for OpenStack. And it kind of shows, this is an integration test rack that has not seen a lot of usage. But it kind of shows a total number of requests, which is a lot of requests that come through just OpenStack talking to each other. There's a lot of OpenStack constantly doing rest requests to each other. And I'm not sure exactly why. I don't know the details of OpenStack. But you can see in the last 12 hours, it handled 4,000 requests. 2,000 were in error. All those errors are from Keystone. And I want to say that just business as usual. That's how Keystone works. So that's something that probably Keystone shouldn't be flagging as an error, because it's not a really error. It just causes noise. One of the ones I like to show is the SEP dashboard. And this is, by default, it's the last four hours. And you can see, it shows you bytes written over time, operations per second. So we're getting this from the logs. And with this, with Elk, you can easily go back and change the duration. If I want to do last 30 days, which was the cluster, this cluster wasn't created in 30 days ago, I can see from the log is the trend over time. So we created the cluster about 20 days ago. We did a lot of testing with SEP at that time. We haven't really touched it since. So the usage, you can see the bytes, red bytes written, operates per second, hasn't been very much. You can see the available space has slowly going down. The used space is slowly going up. So you can see, we're not, total space, or available percent, still way above 80. We're not worried about running out of space. But if you are, you would definitely see this, the fact that you have all this data and you can look at it at any time period you want is very useful. So if you want to draw down to a 15 minutes or a five minute period, you can do that. We had a little spike here, a little stuff that's not really busy, a number of bytes written. It's really cool that you can just create a bunch of dashboards and then it'll automatically change the time frame. I want to show you two other dashboards, or actually one. So one of the things we like to do is monitor the request rate response time. And for the most part, the request rate's all low. But if there was a problem, you would see it here. The response, if somebody complained about slow response, is it Nova, is it Neutron, is it Glantz, what's the problem? You would see it here. You can go to this thing and say, oh, the request response time from this component is much more than normal. And again, you have that ability with the time thing that you want to go 30 days or 15 minutes to drill down on your exact problem. And the last thing I'm going to show is metric beat. So metric beat gives us a bunch of pre-canned dashboards that come out of the box that show the data. So here's our nine systems, how much memory have they used. All controllers are used in a bunch of memory, but the compute and storage is not really doing anything, and this has not heavily been used. When I look at the processes, they are taking the most CPU. And we hover over this. We have one controller taking 33%, and basically the controller is taking much of the total product number of processes running. You can see this is for an individual controller. So integration for controller number three, this is the CPU usage over the period of time, and the memory usage, and this is by process. So beam, which I think is, I guess it's ARSIS log, but I can't remember. No, it's Rabbit and Q. Rabbit and Q was our biggest CPU usage thing. I don't think it's configured correctly. Java's Elasticsearch is you can kind of look at that. You can look at the memory usage. And you can click on any one thing, and by clicking on that one controller, then my graph automatically changes to just logs from that controller. So all the processes in this controller, most of the processes are sleeping. There's a few running. There's ASM processes, and here's the CPU usage for that one controller, the memory usage for that one controller. It gives us a lot of flexibility. I didn't leave much time for questions, so let me wrap this up quickly. So there's future work items that we're looking at. Going to Akata, of course, is something we're going to be doing very soon. Elasticsearch 5.3 is now in our master branch, so that's something that I would like to demo what I couldn't. While I'm here, I'm trying to investigate my NASCA and how it can leverage its monitoring. I really don't know much about it. Opusk Ansible itself is looking at a monitoring framework for Pyke, so I'm trying to learn more about that. And we want to support Sino-S in Red Hat because we have customers from Red Hat. Red Hat shop customers that want to use our tooling, but they want Red Hat, not Ubuntu. So there's some links to the code I demonstrated, the Ops Manager, and to our toolkit that I mentioned before in our reference designs. And questions. If you want a question, please speak in the mic, so we can be recorded. I wanted to ask you, have you tried to use Opusk Ansible without LXC containers? I have not, but it should work fine. But I have no experience with that. If you do have any more questions you want to follow up, my email is at the bottom. That's the best way to contact me. I am WDR as an IRC, but I don't log in very often. Any other questions? OK, well, thank you for your time.