 How's it going everybody? I'm Wade Lewis I'm James Denton and we're here for the no valid host was found trouble shooting trace backs and other common failure scenarios So as Wade introduced himself already, I'm James Denton principal architect for the rack space private cloud team Wait those I am a rack space private cloud architect as well. Yeah, we we work in the support team for the for the product What we're here to talk about is troubleshooting some common open stack issues includes trace backs some common Nova issues and common neutron issues After the presentation we will upload these slides to slide share at the link you see there We've got a lot of slides. We've got a lot of ground to cover so we apologize if we talk a little quickly But we just want to make sure that we get through everything and if we have some time we'll do some Q&A Great so I think we all know that open stack is a complex system. There's a lot of moving parts with limited visibility to problems via the API The old adage of turning it off and turning it back on again doesn't really work here anymore So you'll find yourself in the weeds troubleshooting the stuff on the infrastructure nodes, you know when you have problems So what is a trace back if you're not familiar with open stack and you're coming from say a vSphere ESX environment then you may not be too familiar with them But if you run open stack you've probably seen them before basically a trace back or a stack or stack trace is logged or it's an output or a log of an exception that was captured during the execution of a program When it's caught you can kind of go through and see the functions within the app that have triggered that exception and that'll help you discern what may be the case with the issue or the error that you're receiving Decidering it sometimes if you're new to this can be a lot like reading the matrix we've all seen that before but in this case the trace back looks very similar to this And if you don't rotate your logs or you're not doing so you might want to start now because it will fill up pretty quick The tips on reading a trace back so you know you get this trace back and you're like where do I start the easiest ways to start at the bottom and you can see here just in this example my SQL wasn't available So that's a great place to start if you're not familiar just go to the bottom take a look at it and see what you're working with Yep slides available again We have slides available Okay so let's move into Nova No valid host has found what the heck does that mean a lot of times that error is very ambiguous So we need to kind of get to the bottom of that and see the reasons for this particular error and what the case may be Likely reasons for this conditioning a condition occurring are that there really aren't any hosts available for whatever reason Networking issues on a compute node lack of resources disk RAM CPU etc The good news is these days I haven't actually deployed Liberty yet but with Kilo Juno and Ice House the reporting is a lot better so if you're on an older version of OpenStack Upgrading may help that situation a lot if you run into this particular error it may be more detailed Alright so let's get this thing up an instance you go to spin up an instance and you're in an error state The VM clearly has not failed to launch but a few things we notice here other than the error state is that you are given a network address So you have an IP here and that tells us at least that the neutron API is functioning properly to get you one of those an IP address that is So to get a little bit more information you know you hop into Nova show and you see well there's a stack trace there But among the stack trace as well as we know that we have a libvert instance ID that's good And we know that we have a compute node here as outlined in the red boxes So that being the case if you know it's been scheduled to a compute node the easiest thing to do is to hop over the compute node And start taking a look at some of your Nova compute logs and your neutron agent logs And in this case when we created this particular error it logs another stack trace yet again in the Nova compute log And this time we're getting a Nova exception unexpected vif type equals binding failed And so if you're a new operator you're like I don't know what this means I've been doing Linux for quite a while 15, 16 years now And the first time I saw it was like okay not sure what this is But basically when Nova creates a virtual machine it's got to plug each of the virtual network interfaces into a virtual bridge And the virtual network interface is known as a vif Okay so that's kind of a prereq you may need to know that you can google it usually that info is out there But Nova uses drivers that are specified in the Nova conf to configure that interface into the virtual bridge And when Nova is unable to interface with the network agent and properly plug that port in the vif type is set to binding failed And that's where you get that error So now that we kind of know that it's a network issue the next place to look would be your network logs In this case we're using the Linux bridge agent so we're going to take a look there And what do we find yet another stack trace But this time at the bottom of the stack trace you see the error 19 no such device So that's a little bit better you know as Linux administrators or Unix administrators that's not that uncommon to see across different apps and scenarios But the problem is we don't know what device it is it could be any type of device or their block devices network devices you name it So what is the device Alright now we're going to issue a bit of a warning James and I are operators and we like to you know dabble in python but we're not neutron developers or neutron coding python experts Go easy on us Yeah so we go back to the trace back that we mentioned just a minute ago that shows a no such device error And if you look at the trace back you see a function called get interface Mac from the utils.py file itself And get interface Mac seems pretty self-explanatory it's trying to get the MAC address of this interface but we don't know what that interface is So we dive into a little bit and you can see this code here it's pretty basic Okay and you don't need to know a lot but you know we took a look at it and said there's no error trapping here Let's go ahead and throw that in and see what we get back So we made a quick append to the file saved it through a little bit of exception handling into this it was literally three lines And restart the Linux bridge agent to see what it would log after that And by adding some exception handling we can see now that it's complaining about interface EHT2 So we both looked at it and said okay well that's dead simple someone fat fingered the interface ETH2 So the Linux bridge agent doesn't know what that device is it can't do anything with it and you get that error So we go into the Linux bridge configuration file and in this case the ML2 configuration file And we can see that there it is in the underlined section EHT2 and that's obviously incorrect So we'll make that change restart the agent and see what happens and in this case everything worked properly But what do you do when you do a nova show after you launched an instance and there isn't a compute node there There isn't a livevert instance ID or maybe there is in this case but you have a stack trace again so that's good But again in this particular case I'm not sure if it's kind of hard to see from there But it's a no valid host you get a 500 return from the API and you don't know what it is and now you don't have a compute node to actually go and look at So you're kind of like well now what well if the scheduler can't actually schedule the instance Then you probably should check the scheduler logs there the conductor logs and see what you've got And in this case it's a pretty easy fix Take a look at your agent or your excuse me your services and make sure that they're actually working And the top pass there with the red box we see that in this case we have load balanced API nodes And when it checked in it passed everything was fine it's labeled as up it's working But when we do another service list specifically a Nova service list to get an idea of what the service status looks like We can see it's down and we also know there's a big difference in time so right away that's a problem So what could cause that well when agents check in the database is updated with the time of that check in And other services such as the scheduler take that time and to determine if the service is available The scheduler determines the availability of a host by comparing the difference between its local time And its last seen time of the compute node by default that can exceed 60 seconds and if it does the services mark is down So in short we've got NTP NTP NTP listed there the time was off the scheduler determined the device was down And didn't schedule a host or excuse me an instance to the host so in this case the Nova compute service actually was down There wasn't anything more than just that particular scenario So when in doubt make sure across your node and all your nodes that the time is accurate with NTP or any other method So let's move on to neutron and I'll let James explain some of the scenarios we've got there Great thanks Wade yep So neutron itself is composed of you know many different services and agents that are responsible for constructing and maintaining the virtual network So you can see here failures can occur at any of those points either the DHCP agent L2 agent L3 agent just to name a few So let's start with the DHCP agent Now when instances are created Nova will create a port through the neutron API and an IP that's assigned to a port You know that's a statically assigned IP based on the subnet that you've you've specified The DHCP agent itself is responsible for creating a network namespace in a DNS mask process inside that namespace that is responsible for providing DHCP services to the network When things when the agent fails it can result in failures to get it get an initial lease or a failure to renew a lease Now when you when you create a new subnet with neutron create a port again DHCP agent updates DNS mask with those network attributes and it stores that file in a host file here at varlib neutron slash DHCP slash Network UID slash host and that's a file that's past the DNS mask so the DNS mask knows What IPs are eligible for leasing Here we can see that we have a MAC address that corresponds to a neutron port The host name of that particular instance and the IP address When an instance sends out a lease request DNS mask will hand out that lease and then update its active lease database in memory By default DNS mask will store those messages in or the DHCP cycle messages in varlog syslog or varlog messages depending on your distribution Here we can see that we see the DHCP discover from the client The DHCP offer which is a server presenting an IP address for use The client will turn around and send a DHCP request and then the server will acknowledge that request If you have issues obtaining an IP start with packet captures on a couple of different devices So you want to start on the tap interface of the compute node Tap interface of the instance on the compute node to verify that the instance is actually sending DHCP discover messages Then you can work your way down to the bridge interface on that respective node and then the physical interface You'll turn around and run the same type of captures on the network node with the DHCP agent To verify that the messages are actually making it through the physical and the virtual network stack When you run a DHCP dump or a TCP dump you want to make sure you're listening on UDP port 6768 You should see the full DHCP cycle of the four messages Unless you're running overlay network types that those packets may be encapsulated in a VxLanarGRE header So here we have a working example, right? In the first screenshot on the top I'm performing a TCP dump on the tap interface of an instance And I can see the full DHCP cycle there In a non-working example running the same packet capture I'm only seeing my instance sending out DHCP discover messages that go unanswered And in this case I had actually moved an interface out of the bridge to force of failure Not a likely scenario in production, but it does show that if there is an agent issue and it wasn't able to connect an interface This is the type of output that you might see when you perform a packet capture Great, so now that we're all good with DHCP Let's talk about a live bug that could be affecting a lot of you without realizing So let's say you spun up an instance, you know, no error You realize that the instance is not available Chances are that DHCP is not working and you're troubleshooting So we ran a packet capture on the interface And we see that the actual DHCP discover or the DHCP renewal is making it to the agent But the agent is sending a DHCP NAC packet What this likely means is that the DHCP agent was restarted and the active lease database was deleted So whenever an instance receives a DHCP NAC, it drops its IP off its interface and restarts the entire DHCP lifecycle again Which will likely result in a brief, you know, momentary downtime while that happens This issue was addressed in a patch However, there are some effects there when you're running multiple DHCP agents that this patch, you know, causes some issues with So the patch enables a flag for DNS mask that says DCP authoritative It expects one DHCP agent in the network And when you have more than one, what happens is when an instance goes to renew its lease Since the DHCP request packet is broadcast All the DHCP agents are going to see that packet The one that originally provided the lease will go ahead and renew the lease And the others will reject it So what ends up happening? Same thing, instance drops its IP and starts the whole process over again So right now there's a patch in Liberty that should be backported at some point Where Neutron will pre-populate the lease database when the DHCP agent is started So just like the host file where the agent pre-populates the host file so that the agent knows what IPs and MACs are available for leases It does the same thing with the lease database so that you never have to worry about, you know, getting dropped out of memory Alright, so let's talk about some L2 agent issues The L2 agent was responsible for programming the virtual switching infrastructure And also applying security group rules to Neutron ports A failure of the L2 agent can result in a lack of instance connectivity, security group issues, and an immediate error state during a nova boot So when you're troubleshooting some L2 connectivity issues Here we have an example of an OVS environment where we have one compute node, one network node And there's some interfaces there with stars on them Those are where you're going to want to run your packet captures to make sure that the traffic is actually making it through So we see we traverse tap interfaces, QBR bridges, some v-th pairs, integration bridge, so forth Physical switching infrastructure and then back up through the network node One thing you want to make sure is that in an OVS environment you have what's called the integration bridge And all of your tap interfaces for instances and some of your network agents, they're going to plug into this single bridge When you have different networks, each network gets its own local VLAN ID And that VLAN ID is specific to that node You'll see here in the example that for a particular port here has a VLAN tag of 2 That VLAN tag of 2 corresponds to some real segmentation ID for a network that was created by a user And then there are flow rules that exist on the bridge that are going to translate that local VLAN to the physical VLAN or the overlay segmentation ID If you ever see a tag missing, you'll want to restart the OVS agent on that node because every port in the integration bridge should have a VLAN tag If you ever see a tag that says 4095, 4095 is what they call a dead VLAN and that's sort of an error condition for that particular port So in this case, I had an instance that was on a compute node It was in VLAN 2, it's just the one I just showed you in the last screen I did a neutron port delete and I deleted that port out of the database Immediately on the compute node that port went into VLAN 4095 You can kind of consider that a security mechanism If that instance was still alive on the compute node, still plugged into its bridges But now because I deleted the port, neutron has automatically moved it out of that local VLAN into a dead VLAN So that there's no jeopardy of any sort of security issues Some useful commands you'll want to run when you're troubleshooting or just administering OVS is the OVS vs CTL show command It's going to give you a high level of view of the virtual bridges on that node And it'll also show you the local VLANs particular to that node You can run OVS, OF, CTL, dump flows in the bridge name It's going to actually show you the flow rules on that respective bridge that you've specified And the flow rules are responsible for manipulating traffic and help determine how that traffic should be forwarded across the network You'll often see that that local VLAN ID there will usually be a flow rule there that translates that local VLAN ID to the data link layer, the real VLAN Or forward it on to say the tunnel bridge where there's some rules there that translate it to the segmentation ID of the overlay network And lastly we have OVS, OF, CTL show and the bridge name That's going to give you a port level view of the bridge that you've specified It shows you the port IDs known by OpenVSwitch for every port that's plugged in You'll also see these port IDs specified in the flow rules themselves Great, so now we have a Linux bridge environment, one network node, one compute node The interfaces with the stars on them are where you can run packet captures if you've experienced some connectivity issues Here we have our little dot moving through there In a Linux bridge environment you're going to have a bridge for every network So rather than OVS having a bridge for that particular host, Linux bridge is going to have a bridge for every network that you create In this example here we have network A on top, that's a VXLAN network The bridge name starts with BRQ and then the ID that you see after that actually corresponds to the network ID, the neutron network ID Inside the bridge we have a tap interface that could correspond to DHCP agent or a router or an instance Then we have a VXLAN interface, VXLAN-48 The 48 corresponds to the segmentation ID of the VXLAN network Network B, same thing, a completely different network with ETH2.33, so that's ETH2 with VLAN33 And two tap interfaces that correspond to instances Some useful commands when you're working with Linux bridge, BRCTL show is going to show you a high level view of the virtual bridges on that node Bridge FDB show is going to show you the bridge forwarding database So it's useful for knowing how a MAC address is accessed You can consider it akin to say the CAM table on a physical switch And then IP neighbor show will show you the ARP cache on that node Great, so now here we are again, VIF type binding failed We saw this error in the nova example that Wade talked about So you usually see this when you're booting an instance or attaching an interface It's typically the result of a neutron misconfiguration or an agent issue And it's not just limited to instance ports as we'll show in the next slide So I apologize, it's a little bit hard to see But here we have a DHCP port and an L3 router port that are both in a binding failed status And what happened here is that a tenant created a network They enabled DHCP on a subnet and they attached that network to a router And what they realized is that their instance was not able to get an IP And when they assigned an IP manually inside the console, they were not able to hit the gateway So taking a look at the L2 agent log on the network node hosting those two devices We can see that there was an error in the log that says VXLANs enabled A valid local IP must be provided So in the ML2 configuration file when you're configuring interface mappings on the host You'll also configure VXLAN information like the VTEP address for that node And when the Linux bridge agent starts, it takes that IP and it tries to configure point-to-point tunnels between the hosts And when the IP is either wrong or it's not applied to an interface, you're going to see this error So what happens is that the neutron itself is not able to interface with the agent to determine how to plug in the DHCP or the L3 ports And you get a binding failed error So correcting that problem restarting the agent should result in successful port bindings Now one of the problems we have here is what do we do about the existing ports So our tenant is still having issues And what do you have to do to resolve this And it's probably bugged and we should report it We're not there yet To fix a router port what you would need to do is unschedule the tenant network from the router or the L3 agent Reschedule the tenant network to the L3 agent And that in turn is going to create a new port that should bind correctly The DHCP port is a little bit different When you unschedule the network from the DHCP agent, neutron actually puts the port into a reserved status The reason it does this is if you were ever to unschedule your network from your agent And it were to delete the port like it used to And let's just say that you spun up a thousand instances and exceeded your subnet pool Now if you were to schedule that network back to the agent there would no longer be any IPs to assign to that DHCP server So what you would have is an issue where you can hand out addresses So you can delete the port here Reschedule the tenant network to the DHCP agent and then create a new port I should add that it goes into reserved status When you reschedule the network to the agent it does not update the binding status So it remains in a failed state which is why we need to delete the port here So when you're troubleshooting the L2 agent a couple things you want to do Make sure that the respective L2 agent on the host is configured properly And is actually running and not in a constant restart Upstart and if there's a failure we'll continue to restart the service You'll notice the PIDs are changing so something to watch for You want to make sure that open vSwitch is running Most of the time OBS is not running on that host and you try to schedule an instance to that host You'll get an immediate failure so that something will look at there And then the agent logs are stored in var log neutron on that host Here we can see neutron slash plug-in Linux bridge agent dot log Alright last but not least the L3 agent So the L3 agent is responsible for creating network namespaces for each router that a tenant creates That router in turn provides routing services between tenant networks And it also provides NAT services to an instance When it fails you'll find that you're not able to route traffic And floating IPs themselves may be inoperable So inside the router namespace is running IP tables And for every floating IP that gets created there's some corresponding IP tables rules for that floating address Sorry it's hard to see but here we have a couple arrows that are pointing to IP tables rules specific to that floating address There's a D NAT rule, two D NAT rules and a SNAP rule for that All other traffic that isn't actually handled by a floating IP is going to be automatically sourced and added by the router If you have issues with floating IPs not operating properly You can actually go into the namespace, make sure those rules are there And if you want to you can add some by hand just to verify that the action will work properly But as soon as you restart the L3 agent all of those rules will get wiped So when troubleshooting the L3 agent you want to make sure that it's actually running again And you want to make sure that it's configured properly Perform packet captures within the router namespace on the tap interfaces inside the namespace To make sure that traffic that you're trying to reach is actually making it through the virtual switching infrastructure and to the namespace And if your floating IPs aren't working properly make sure that the IP tables rules that should exist for that floating IP are actually there And you can also check the log and farlog neutron to see if there's any stack traces being reported A little more neutron here A couple of things you want to be aware of when you're running overlay networks like VXLAN or GRE If you're running the default Ethernet MTU which is typically 1500 on your physical switch ports Which in turn is usually on the virtual switch ports as well And you use an overlay like VXLAN There's some headers that are added to each packet that can cause you to exceed the MTU and that traffic to be silently dropped So normally this will manifest itself in issues connecting to instances via SSH And when you enable verbose logging there you'll see that you may have the handshake but in the middle it just hangs And what you can do is update your subnet to pass DHCP option 26 Try dropping the MTU to about 1450 and then rebooting the instance More than likely that's going to fix that If an alternative to lowering the MTU on your instance may be enabling jumbo frames on the VTEP interfaces of each host and your switch ports And don't forget security groups too So I can tell you that we've troubleshot a lot of issues that turn around and end up being the lack of a security group rule or a misconfigured security group rule So if you find that the plumbing looks good and you're at your wit's end Try creating a security group rule and applying it to the port where you can enable connectivity from say the router namespace or the DHCP agent namespace Test connectivity with ICMP or something not impactful there Some other things that you may only see at scale are race conditions Caused by a lot of services doing things in parallel We used to see this a lot in Havana and Icehouse But you don't really see it as much in some of the newer releases There's a case where if you spin up you know the hundreds of instances that you know Neutron has to go through and set up DNS mask host files and do these things while Nova's simultaneously spinning up the instances They're ready sending DHCP requests Neutron hasn't caught up you may come into a condition where your instances have time out or given up And you have to reboot them Some other problems you may have some default system parameters are too low You may see this in default ARP table sizes where the threshold starts at around 512 ARP entries on a host If you have a lot of instances on a host You'll find that you have some random or sporadic connectivity issues You'll also see this when you're running the L2 population driver with Linux bridge What the L2 population driver does is that it pre-populates the forwarding table on a host With information about how to reach all the MAC addresses so that you don't see a lot of broadcast packets on the overlay Most notably on a network node say the L3 agent or DHCP agent A router that's connected to a couple of networks a lot of instances it has to know how to reach all of them So you will quickly exceed that table And that's a sysctl parameter that you can change on a host Also if you don't have any disk space available on a node and that service is dependent on writing to a file before actually proceeding You may prematurely kill a service by running out of disk space So make sure you keep an eye on that And last but not least syslog is your friend There's a lot of messages that aren't logged in a Neutron log file That are logged in a syslog file So if you're not finding anything check your syslog and you know think outside the box sometimes an error is somewhat nondescript But it could be completely related Thanks James Last we're going to close with some takeaways The Neutron failures can be traced back mainly to configuration file issues So have a good look there start with just a real basic config if you can And that should get you back to a running point To help you troubleshoot Follows, fails Yeah so Problem exists between the keyboard and chair a lot of times And if you can you know obviously you have to get the KVM The lower level workings of that would be nice Those little vertex file files All of that type of stuff Open vswitch OBS I mean that in itself is getting familiar with it would be very beneficial Especially when you see stuff like broadcast storms that you know where to look Linux bridging and IP tables Most certainly are important points to review if you have issues or you're running the Linux bridge agent Also familiarize yourself with the working environment if you can If you have a dev environment that's working you can use as a reference That will pay huge dividends Turn on debug mode we haven't mentioned that If you do just keep an eye on your logs they're going to fill up quick Start services by hand A lot of times you'll see errors that way that you don't see in the log Reach out to the community And gather as much information as possible before submitting a bug That way you don't waste your time when it gets rejected So that it doesn't get rejected And that's it don't be afraid to break stuff James and I have a ton of experience doing that So if you have any questions about breaking Neutron or Nova We'll be available to talk Thank you very much for your time I just wanted to add to that a little bit So when we say don't be afraid to break things obviously not production But I'm a real big advocate of creating lab environments Virtual box is a great place to start if you don't have the resources Get in there docs.openstack.org is great They have a lot of different scenarios that will help you figure things out When you're breaking stuff You know, spin up an instance Let it get connected to a bridge Make sure it works Then start pulling interfaces out of bridges and just see what the behavior is Try and create failure scenarios So that you can sort of reverse engineer when you do have failures in production You know where to look And then lastly so We're giving away some books here During the morning and afternoon breaks In the morning they're giving away the OpenStack Cloud Computing Cookbook The latest third edition In the afternoon they're giving away The first edition And then coming soon At the next summit we'll have the second edition of the same book So in there you'll learn More or less you'll get a good foundation On what Neutron should look like And a reference architecture With some troubleshooting So stop by and grab a copy And if you want the slides Again after this Sometime this afternoon we'll try and throw them up on SlideShare But it looks like we have some time So we can be at the back of the room And see what kind of solutions you've got Or do our best Thank you