 Hello everyone, I'm Balaji and he's Shiva and he's been a rescue we all work for HP as a scalability and performance engineers and hilly on open stack and cloud system we are here to share our experience and how quickly we try as common open stack issues with simple automation script written with an ansible. So, so in our day to day execution what we do we do scale up multiple we bring multiple scale environments and we do scale up of all the major components like NOVA, it's a neutron and sender all these stuffs we scale up. So in this we and we come up we do face lot of problems in setting up environment like it can be an configuration issue or a setup issue or a hardware issue. So the issues can happen in our scale testing and troubleshooting all this can take some ample amount of time for each and every engineer or an operator who does that. So here so any engineer who based on the skill sets the analysis time gonna vary. So if an intermediator and a beginner for open stack probably analyzing issue so will take some time when it comes to specifically open stack. So open stack has grown up so huge so in creating lot of services components etc. So the as open stack has grown up similarly my troubleshooting my open stack issues are also complex as there are many interdependent services running on say for example say have a no boat failure. So I need to check upon multiple services say my NOVA services my neutron my database services so all these and when it comes to when we further debug into it say once we start with the locks so we need to check for the traces with the request IDs and the UUIDs and with this we'll start up then probably we need to move on to our other compute services other services and database services and messaging queue it goes on. So if you attend our earlier session so the new troubleshooting and neutron how complex it is probably we would have seen that so troubleshooting is never easy. So there is no predefined steps defined for any sort of issue here you go troubleshoot this you're gonna get fixed. So that's not the case with any sort of problems. So we need to analyze each and every defect based on the we should have some checkpoints for each and everything as I said earlier so for a NOVA boat we need to go each and every sorry so for each and every step to check what exactly has caused that problem and fix it. So there are each for each issue there are different checkpoints available. So in similarly so for these issues what we have done so we're different depending upon the complexity of issues these the issues gonna take time to get it troubleshooted so for an beginner or an intermittent definitely analyzing each and every components will definitely take some time maybe for an operator or an admin when it comes to production environment definitely we need to quickly address those issues because if take an example of NOVA boat if suddenly it doesn't happen probably we are losing our instance we won't be able to spawn the VM or manage the VM so definitely we need to address it as quickly as possible. So this is what we have thought and most of the time we spend a lot of effort and time in troubleshooting this. So we had a thought why don't we because automate this because each and every step we for steps we follow in troubleshooting are defined okay so if we have a neutron what are all the checkpoints we need to check so probably each and every engineer based on their expertise level they'll have some defined steps. So we have what we need to we have checked all these checkpoints what you have done to troubleshoot what we have done to this is like we have checked all these collected all these troubleshooting steps for each and every component say for a NOVA, Neutron, Cinder so for all these components we have picked in all the individual checkpoints for example say Neutron say what I have if I don't VM is not pinging what are all the checkpoints I do in our session also they would have put say like in my instance so I need to check for my VM shader services and lot of lot many things until I get the things fixed so it gonna heat up my time and so I need to how I can automate this was a thing which was floating around us so what we have done so each of us so had a individual style of troubleshooting issues say I have my own style so we have collected all these checkpoints whatever we do we have collected all our checkpoints so in troubleshooting specific issue say we have added those checkpoints to a playbook so this playbook whenever we execute that so it will parallelly run on all these checkpoints say my ping is not working it will immediately go and check for my services DHCP agents on all these stuffs in parallel say if you are trying it in manual definitely we need to go each and every things one by one say check for the first services then logs then each and individual components it's like a one by one step definitely it gonna eat up our time and probably an admin won't be interested in doing that or any sort of engineer who is using that won't be interested in eating up times so probably this would be an better solution so what we can we have added all these checkpoints whatever we have to that specific playbook it runs on parallelly on each and every components of that related to that specific issue say it's a no ping no affiliate it runs on parallel all the dependent logs and services parallelly and it collects all the information and it puts out to you where exactly is the issue in front of you within no time so this will be this will reduce the effort and time for any engineer or an operator who uses this so and probably we have chosen Ansible for this for doing this probably winners will pick why we have chosen Ansible why we have chosen Ansible compared to other programming language because Ansible is modular we can create our own custom module and we can use the same module for multiple playbooks and it is very simple to create a playbook with minimal programming language so even an administrator or testing engineer can create their own playbook based on the troubleshooting flow and try to find the problem where it is exactly and for a distributed system like OpenStack Ansible is the best solution to find the issues on multiple nodes the issue can occur at any point of time at any node it's better to use Ansible to find the issues occur in that and it is very simple to run the playbook you just if you know how to run a playbook and what is the playbook to run we can just trigger the playbook to find the root cause of the problem occurred in the environment so by taking the Ansible feature of creating custom module we have create our own combined all the troubleshooting steps we do and created our own custom module for each troubleshooting steps all these custom modules are kept separately for each services like NOVA there are some some steps to find the problem so we kept individually for each services the custom in the Ansible directory so we can group those custom modules and create some play the play can run to get some details from the OpenStack or to check the component status in the or to check some service in the OpenStack so all these plays can combine and together and create a new playbook we have created a playbook based on the common issues we encounter in the OpenStack such as like if there is a VM IP failure we can take the VM IP failure playbook and trigger it the playbook will run through all the steps on based on the roles of the nodes so it may run through the controller node or the computer node or try to find out the issue where it is it in turn it will use all the custom modules or the built-in modules which are available in the Ansible these are some of the custom modules we have created so one module is get VM details so whatever troubleshooting we start we first we may need to get the instance details so we can just run this custom module and try to get all the details of the instance basically just transfer Nova show command and get all the information store it in a JSON the JSON can be used by some other modules to query trill down further so similarly there is another module to get the network details so it will get all the network ID state everything we can use the network module for some other to trigger some other module even give input to some other module so there is another module to check the port binding status and there is another module to check the DHCP namespace whether it is available or not so Shiva will explain how the modules are created hello everyone so I thought of actually explaining one of the scenarios we've taken which is most common in our OpenStack deployments so usually we encounter VM IP failures like VM will will get an IP but internally the VM doesn't get an IP from DHCP or whatever so we thought of explaining that how we actually drill down the flow and then how do we automate that so the first thing is we'll go to the we'll do some set of tasks in the controller node so we will do first we'll get the as winners who said we'll get the VM details and then we'll actually get for the first we will check the VM status that is obvious that it should be a VM should be active and then we'll get the VM details and which has all the hypervisor information network information and the segment the network ID and the IP address assigned to the VMs so we get all the information and we store it in JSON so as you all know if you have worked on Ansible if you have written a custom module so everything actually should be returning a JSON so all the custom module should return a JSON so actually we export that as a JSON and then we which will be consumed by other modules so and then we will actually check for the network details which will be another thing so which we where we will get all the segmentation ID and the other things and what kind of provider network so is it a VLAN or a VXLAN or a GRE network and you and we will check for the security group rules so what are the rules assigned to it and and then we will check for the port binding so port binding is one of the first check we will do so I will just skim through it because I think most of you have attended the last session with detailed information so I'll not go deep into that so we will check the port binding status whether it is active or not and then we will move to the when everything is fine here so we will move actually to the network node and then we will check for the neutron services so we will check for whether all the services DHCP service metadata agent and other things are all active and so we will check for the namespace so as I've already said so there are some details we are getting from the controller node so we actually get the network ID from their network ID we can actually check the what is the DHCP namespace assigned to it so once we get the namespace we will check whether the namespace is actually created and then then we can actually use do some operations with the namespace and we will check for the open V-switch ports so most of the issues we encounter in neutron is because of configuration issues so we might not have configured one of the bridges correctly or one of the ports is not there or similar to that so there won't be an integration bridge or something tunnel bridge so we will check all of this validate all of those and make sure that the open V-switch there is no configuration issue and then we will similarly we will check for the VxLand tunnel configuration as so we will check whether the tunnel is established between the compute node and the network node and the compute node and vice versa so we will check for the VxLand configuration and on the compute nodes we will actually check for the neutron agent services the similar agents and the IP table rules so IP table rules where we will check for the DHCP information whether the port 67 and 68 are all active and other there are a bunch of rules which we need to validate and then we will check for the open V-switch ports and then the similar the VxLand tunnel is established and we were able to ping the VxLand IP so if there is so we what we have done is we actually get the network detail getting the network details initially so we will find what kind of a network is if it is a flat network so we will skip some of these custom modules and then we will if it is a VLAN there are some configuration we don't have to check the VxLand flows and all the stuff so we have actually made it modular so that so we can put some more information we can drop some of the information and then so these are some of the steps we have not actually put everything together because it doesn't fit on one page so so these are some of the things we identified so this information is provided by engineers over us worked on it and then we collected it and once we have got this so what we thought is we do it every now and then for every instance or every time we encounter this issue we follow the same procedure which follow the same checkpoints so what we can do is we can just take this modular components automate it and then keep them as simple Python scripts like which can be used in Ansible so what we have done is we have actually used made this as custom model each of them as a one custom module so once we have arrived at all these custom modules so what we will do is we'll put together all of them in one play so one play will actually run on controller node or one play in run on the network node so we have categorized them according to the type of nodes so one play which will run all the custom modules related to the type of the node so if this is actually what we have followed so this is open to interpretation so we can actually if you have separate nodes for an over components or separate node for the neutron components so you can design it as a one but we have followed this modularity based on the helium open stack we have used so and for the network node we have similar modules we have grouped them together and put it in one play and then for the compute node the similar play so once we have all these place together so we give that to the test engineers or the whoever is using open stack who wants to triage or do this so they can use this and they can actually just call this custom modules together and then put it in stitch it in one piece and then they can have their own steps they can add it plug it in and then they can form that procedure to troubleshoot and then is it doesn't require any programming language to after this so once you have the custom modules written you don't have to know about any programming language you just say I get the VM details of this ID and then check the DHCP name space of this network details or something like that so we put that in procedure format and then we can use it whenever you want so this constitutes one playbook so we have named the playbooks in the order of like what is the issue and that will be one playbook so if it is a VM IP failure this be a VM IP failure playbook and if it's a NOAA error state or VM error state there is one playbook so we have done like that so so this is how we have done so I'll just came through how the module actually we have written it's very simple it's just requires a basic Python language knowledge so what we will do is we will try one function per module so the rule of thumb is actually you will have to write one only one task per or one troubleshooting checkpoint per module so that makes it very modular we can use it in multiple other playbooks suppose if there is a NOAA boot failure then you will need to get the instance details so this can be used for multiple playbooks so I would suggest that we will we always followed like to write one checkpoint per module so we we have used the basic ansible basic functions which is already defined so we have used the module ansible modules functions like fail json and exit json which they handled it pretty well so where we can say that this is failed and this is the message you have to throw and if it is changed if it is changed or not you just give it and then you say it's exited properly and you just give the inputs and and then that's it so it is just a 10 to 11 lines of code per each module and then once you have this module written then you can actually go to the playbook so this is how the playbook looks so you will have the thing which is in green or the modules we have already written so so this is as simple as you just say get VM details and you pass the VM ID so once you have that and then get need a network details and what is the network name and so once and then you can see that the host it can run where it has to run it's a local host or it's in the controller nodes or whatever you can specify and then so that one component is is actually for one play constitutes one play so we have three place here so one will run on the local host and one will run on the network nodes and one will run on the compute nodes so so this is how we write the procedure and then you can see the on the right there is a host file so where we will specify what are the network nodes so you can have a tension of network nodes and a lot of kind of network compute nodes usually what we faced in scale testing is so whenever we want to try as any of the issue so we have to go to multiple nodes if there is you've mentioned a high availability for network network you will have to go to multiple network nodes because the namespace is existing on multiple network nodes so you will have to specify what are the network nodes and then it ansible since we are using ansible it makes it very easy actually to run it on all all the checkpoints parallelly in all the nodes and then you put compute nodes and all the details and then you run this playbook and you will give give you what is the actual error so I'll just show you some of the screenshots what we will get on this so so this is one of the play where it is failed that there is a configuration issue so you there is no conflict tunnel bridges configure so so that's why it is fail so this will give you some of the messages like this so whatever we put in the custom module that message is shown so similarly if you have issues in multiple places you can actually find out in one single run and if this is one of the sample screenshot where everything is passed you will get all greens that means there could be an issue which we have not actually traced it so this helps actually you don't have to do the triaging steps again and again whatever is known is already covered so you just run this playbook and then it says that okay so these are the things that I already know these checkpoints are covered and there is something else which is causing the issue which is a new thing so the new error or it could be a code issue or as a bug so that's how you figure it out so that's all I have right now so so you can have so these are the references we have got so so we have taken this example from this blog so and then we have automated it so you can have a look and so there is a git repo where I put the sample file where I've written for the ping and so you can have a look at this git repo and contact us if it is there is any issue with that so we can help you and I know how to automate this procedure procedure any questions