 reviewer for neutron and today I'm gonna talk about I Can't ping my VM. So it's a talk about common problems in neutron and I will provide a solution to solve them So Usually when somebody comes to me and has a problem with networking, that's the kind of expression this person has Like if you don't know much about Open stack or a neutron It's just very hard to understand what's going on like people try to fix their problem with the random solutions like rebooting the VM or restarting neutron or I don't know any any kind of weird stuff and they try to find a solution online But it's not always there and in the end they just get very upset and they want to break everything but So what I tell you now is just brief relax because probably after this talk you will be able to solve the most common issues and you'll be happy so We can classify the most common problems in two categories. So I have to say Most of them are due actually to Misconfiguration so like you put something wrong in in the config files This happens a lot if you install open stack or yourself But it might also happen if you're using a tool and you are not configuring the tool correctly or It may be a misconfiguration of the underlying network Because even if neutron is virtual actually did the packets they flow In the physical network. So if the physical network has problems, of course neutron won't work and so common problem might be You have a firewall that it's filtering some some packets or you have a switch that for It's configured not to let some VLAN ID goes through. So these are Common issues or you might actually encounter a real bug in the code In this case You might find a solution online because I don't think you're the only one hitting it And if you are the very first person that find this bug then you should file a bug report Then the neutron team will look into it and probably will come back to you with that solution quickly So let's start with the first Issue that you might find I can't ping or I can't SSH my VM using its private IP so the way I structure this talk is that in the end I have like a few problems and For each of them in the first part I will just give you some background that you need to understand What's going on and then in the second part? I will actually provide you a solution or I will guide you through Find a solution So this This problem ping in the private IP of a VM. I have to say it's not very common the most common is probably To ping using the floating IP, but I think it's It this is easier. So it's better if we tackle this first. So it will be Easier for you to understand the floating IP case So these are like our first checks that you can do and These are no brainer. I mean even if you don't understand the internals you will be able to perform them so the first one Is the VM up and running? So this might seem very trivial, but actually it's not so trivial because Yeah, believe me. I found many times that like somebody Comes to me and say oh user is not working and then the VM is not even up so of course if the VM is not up then you won't be able to to reach it and So a good way to to check that is if you do a novelist and you check the status of the VM so if you see that the VM is an error state you should understand why and to do that you can just grab for trace in the Neutron folder in the log folder Neutron and Nova and you probably find a stack trace that tells you what's what's going on and Sometimes it might be not even like neutron related. Maybe You're out of you run out of this space. So that's why the VM is not up. So I suggest you check that first then another Kind of trivial check is So remember that the default security group they don't allow ICMP you have to configure that so if you don't do that then of course you won't be able to pin your VM because the traffic will be blocked and Then last check is about the underlying network so the physical network So check that the nodes of your cloud are able to pin each other because if if you can't reach a node, of course, you have a problem there and This won't work. Okay, so now let's let's dig farther and understand a bit better How do all things work? So first of all, we need to answer this question. How does a VM get an IP in Neutron and To do that we have to introduce a Neutron agent that it's the DHCP agent So as the name says it's the the agent that it's in charge of the HCP For those of you who are not very familiar with networking The HCP is a protocol that is there to provide an IP to the machines and The DHCP agent communicates With the Neutron server using RPC It ensures Network isolation using namespaces. So every network has its own DHCP namespace and Inside this namespace There's a process called DNS mask That it's the one that it's actually Serving the DHCP so the DHCP agent configures This DNS mask using a list file So now let's see more in detail how this IP allocation works. So Let's start from Nova compute that receive a request to create a VM So Nova compute will Ask the Neutron server to allocate the network you see point one We'll see in the next slides more in details how this all works, but for now for the DHCP So let's assume everything went fine on the Neutron side and the port was created successfully and At this point the Neutron server will send a notification to the DHCP agent C point two so the port creator and the notification and the DHCP agent So knows that there's a new port a new IP to serve and it will update the list file for DNS mask using this method is called a reload allocation Point three and will also force a DNS mask to Load this new configuration file So that the the new IP will be served and the VM will get its IP so now I Really want you to understand what's like the journey of the packet because I think in networking problems It's really important that you know where the packet is supposed to be so that you you can investigate and Really see where the packet gets dropped or lost and you can understand a bit more. What's the problem? So we have two default implementation in Neutron open this switch and Linux bridge. I will Explain both so let's start with open this switch You see here We have a compute host where the VM is running and the network hosts where the DHCP agent is running and you see the DHCP name space there So let's start from Point one that it's the VM that it's requesting an IP So this request will go through the firewall bridge the firewall bridge is a Linux bridge and It's actually there to be able to apply security groups security groups in Neutron, they are firewall rules and they are implemented using IP tables So unfortunately you cannot apply IP tables to an interface that it's connected to an open this which port So that's why we had to put this Linux bridge in the middle and that's why it's called the firewall bridge so The IP tables will let the packet go through and it will reach the integration bridge at point two The integration bridge is the bridge that it's in charge of Tugging and untugging the traffic that it's coming from the VM and going to the going to the VM Using the VLAN ID associated with the network. So every network has a VLAN ID and and This VLAN ID is used internally in in the compute host to isolate the traffic. That's why we call it local VLAN ID and Then so in this configuration, I'm using a tunneling, of course you see the tunnel bridge So the packet will then go to the tunnel bridge point three The tunnel bridge as the name says is the bridge in charge of the tunneling and so it has the flows that Will translate the VLAN ID assigned to the network into the segmentation ID So for example, if you're using a grid tunnels The grid tunnel ID will be the segmentation ID assigned to the network So at this point the packet will be encapsulated and will go Outside of the host so through the wire and will reach the network host In the network host the tunnel bridge on the network hosts will Decapsulate the packet so you see at point four and the packet will go to the integration bridge and Finally at point five it will reach the DHCP name space Inside the DHCP name space. There's DNS mask running So it will get the request of the VM and it will reply with a DHCP offer Offering the IP that it's assigned to the VM So let's see now the Linux bridge Implementation you see that it's slightly more simple So let's Start again from the compute host. You see the VM that it's sending the request in point one In the Linux bridge implementation We have one Linux bridge for every network. So you see The net one bridge so that it's the bridge corresponding to net one and the VM is on that one in this graph, so I'm assuming we are using VLANs and The VLAN assigned to net one is VLAN hundred. So you see that The interface plugged into net one bridge is a eth0 dot one hundred that it's the one corresponding to to the VLAN one hundred So at point two the packets will go through this interface and then will be will go outside of the host It will be tagged with the VLAN hundred. It will reach the network hosts It will be untucked. So at point three and it will get to the net one bridge on the network host And then at point four will finally reach the DHCP name space where again, there's DNS mask serving the IP for the VM So now if you want to dig farther if you want to find the problem These are the kind of question that you should Think of like so first check did the VM receive an IP and you can check that from the console You can just do IP other to see if the VM has got an IP and if not, of course You have a problem there and of course you can't ping the VM Then if the VM didn't receive an IP, let's try to understand why So first things to check is is the DHCP agent up and running of course if the DHCP agent is down it won't work and Another thing you can check is is DNS mask running inside the DHCP name space and Then you can check if the list file is correctly filled So for example if the IP for the VM is not there, then of course you have a problem Then something that you can check regarding the underlying network is like Is your physical switch maybe and not allowing some VLAN ID? Check that because of course it would cause problems And then like the last resort and this is true for for old networking problem It's like just TCP dump all the way So from point one to two to three and just to see where the packet gets lost and try to understand What's the problem there? So let's see that the next Issue that you might find is like the VM can't reach the external world so for example, you're trying to reach opensack.org from from the VM and Nothing So to understand this problem We have to introduce a new agent the the L3 agent that it's the agent that in neutron is in charge of Providing L3 connectivity and nothing it runs on the network node Same as the DHCP agent and it also uses namespaces to ensure network isolations Actually, the the router in neutron is implemented using namespaces and It's it's the agent that provide access to the external network So now again, let's see in in both cases What's the journey of the packet when the VM is trying to reach the outside world? So same as before you see the VM at point one is sending this packet that it's supposed to reach the outside world It will go through the firewall bridge the packet goes through it will go to the integration bridge and Then to the tunnel bridge so the packet is encapsulated and goes through the tunnel it will reach the network host and Then at point four it will be decapsulated It will go through the integration bridge and then it will reach the router namespace Because the router is the different gateway for the VM So at this point what happens is that there are P table rules that perform Snatching so the private IP of the VM is translated into the public IP of the router and after that The packet will go to the external bridge that it's the default gateway from from the router namespace and From the external bridge it will go to the outside world It's So same story for for the Linux bridge just with the with the different Architecture so we have we have the VM that it's sending the packet it goes through the net one bridge Point one then a point two it will go through ETH zero dot hundred It will go outside tagged with villain hundred. It will be received on the network host at point three it will reach Net one bridge on the network host again a point four It will reach the router namespace because the router is the default gateway. It will be snatted and Then at point five will go through the external bridge and again a point six will reach the outside world So in this case, what are the things that you need to check The first one is very trivial, but like to be able to to reach the external world The the network needs to be connected to a router that has access to the external world So and you do that in utron when you set the router as a gateway so the router will be connected to the external network and Then the VM will be able to reach the external world Of course if if you are on a network that has no router or whose router is not connected to the external network this won't work and Then another check that you can do is like can the VM ping the router This is like to narrow down a little bit The part of the journey that you have to investigate to find the problem Because for example if the VM cannot ping the router You know that it will never reach the external world So you can focus your attention on on the part of the journey from the VM to the router and see what's the problem there and another thing is Is the external bridge configured correctly? So remember that you have to be able to reach the external network from the external bridge So you need to have an interface that it's plugged in in the external bridge that from there You're able to reach the external world if that's not the case of course you have no connectivity And you can check that using of yes, yes Ctl show that will dump the configuration of of the bridges in the obvious case or BR Ctl in if you're using Linux bridge and Then another check is can you reach the external world from the router namespace? This is again to to narrow the part of the journey that you analyze because of course if the router is not able to reach the external network the VM won't be able either and Again, if you're using villains make sure that those villain IDs are allowed in the underlying network So now I think this is probably the most common one like I can't ping or nor SSH My VM floating IP So again, let's see what's the journey of the packet So we have the open V switch Implementation so as before the VM is sending the packet at point one It goes to the through the firewall bridge that lets the packet through then at point two It will reach the integration bridge. It will be tagged then at point three It reaches the tunnel bridge that encapsulated the packet that send it through the tunnel It will be received on the network host from the tunnel. It will be decapsulated at point four It will reach the integration bridge and then at point of five. It will reach the router namespace So at this point since the VM as a floating IP assigned there are IP table rules that Snaps the the packet so the the source IP that it's the private IP of the VM will be translated into the floating IP Assigned to the VM and then a point six the packet will go to the external bridge that it's the default gateway and Then at point seven it will reach the external world So I have the same on Linux bridge, but this time let's see the other way around and it's probably funnier for you So let's start from point six. So from the external word. So in this case, I am pinging the VM So the destination IP is the floating IP of the VM then point six the packet will go through the external bridge From the external bridge. I will go through the router namespace point five Then since we have a floating IP assigned. There's an IP table rules that translate the destination IP of The packet so from the floating IP it will be translated into the private IP of the VM and So from the router namespace it will go to the network one bridge then again it will go through the Interface it is zero dot one hundred it will be tagged and will go will reach the compute host will be untagged point two and I worry it will reach The network one bridge and so if security group will will let the packet through then at point one will finally reach the VM So what do you have to check? for this problem again So remember did you configure the security group properly you have to allow ping and SSH then another thing you can check is like Is pinging the private IP of the VM working? So it's the first case that In this presentation because of course if you can't ping the private IP You have a problem there and you won't be able to ping the floating IP So it's better you investigate the private IP case And then can the VM ping the router Because if the VM cannot reach the router, of course, it won't be able to to reach the external words the floating IP won't work and Then another thing you can check is can you ping the VM from the router namespace? And you can also check if you can ping those floating IP from the router namespace It's a stupid check because the floating IP actually like leaves in the router namespace But if this is not working then you know that that's there's something really messed up in the configuration of the router namespace and Then again another thing that you can check is The configuration of the bridges with the OBS VS CTL show to see if you can spot some problem there and then again the last result is to TCP dump all the way to see where the packet get lost or filtered or whatever and and So another common issue like the VM can't reach the metadata server the the metadata server is the service that serves the the metadata for the VM like for example if you have SSH key then the metadata server will serve that to the VM and In neutron There's the metadata agent that it's an agent in charge of proxying the requests from the VM To do met to the metadata servers to Nova There are two ways you can configure it Routed networks that it's when you have a network that it's connected to a router and so we'll see more in next slides and Non-routed networks when you have a network that is not connected to a router, so it's isolated So let's see both cases. So this is the case of routed networks So in this case the metadata proxy that it's a process that That it's spawned to proxy the request to the metadata agent This is spawned by the L3 agent and leaves in the router name space So let's see now the journey of the packet when one of the VM is trying to get its metadata So I'll explain only the Linux bridge Implementation So you see on the compute host a point one the VM is sending this packet it will go To net net one bridge It will go through ETH zero dot one hundred point two will be tagged Using the VLAN ID assigned to the network we reached the network host on point three and then we'll reach the net one bridge on the network host and At point four will reach the router name space because again the router is the default gateway for the VM and So in the router name space, there's an IP table rule in style that will redirects all the traffic that it's meant For the metadata server it will redirect it to the metadata proxy that was spawned by the L3 agent and The metadata proxy is a process that it's there a listening for requests And so we'll get the request of the VM and it will add Some information in the HTTP header it will add the IP of the VM and the router ID and Then we'll forward the packet to the metadata agent point five the metadata agent then Using the the IP of the VM and the router ID It will be able to request the instance ID of the VM of the VM to the new turn server and this is needed For to forward the request than to know that point six so that Nova will be able to serve the metadata for the VM so the other Like configuration that you can have is isolated networks this is when When you have a network that it's not connected to a router, but you still want the metadata to be served to the VM and You need to enable a flag in the dhcp Configuration file dhcp agent configuration file So if you set isolated metadata to true This is how it will work. So I explained before that the ACP agent is the agent that serves DHCP and in the dhcp protocol you can of course You it's used to assign an IP To a machine, but you can also specify other option like you can inject a route using option one-to-one and That's what you what's used in this case so when When the VM boots and request for an IP it will also get a route injected and these routes will Set as next hop to reach the metadata server the IP of the DHCP port that it's The IP of an interface that resides in the DHCP name space So the VM will know that to reach the metadata server As next hop it it will need to go through the DHCP name space So let's go through the journey. So at point one The VM is requesting the metadata the request will go through the net one bridge Will go at point two through this ETA zero one hundred interface It's will be tagged and will reach the network host It will be untagged point three and it will reach Network one bridge and then at point four will reach the DHCP name space In the DHCP name space, we have again the metadata proxy that this time it was spawned by the DHCP agent and the metadata proxy Listens for the request so will take the request of the VM. It will add again some information in the HTTP other specifically the IP of the VM and the network ID of the network and Will forward the request to the metadata agent point five The metadata agent will get the instance ID of the VM will Put it again in the HTTP header and will forward the request to Nova So in this case things that you should check Is the metadata agent up? Of course if it's not up it won't work is the metadata proxy up Then you can look at the logs in the neutral metadata agent and Nova metadata agent to see if you find some trace that Might help you understand. What's the problem and Then you can check that if the metadata server is reachable from the router name space or from the DHCP name space depending on if you're using the rooted network or isolated network and then one specific check for Isolated network So make sure that you the image you are using for your VM Supports option one-to-one because if it doesn't it of course want to receive the Injected route and it won't work like So time ago the zero C much for example was not supported the option one-to-one and Then yes last resort just TCP dump all the way to see Where the packet gets lost? So we have the the last issue like with plugging timeouts. I don't know if any of you Got this kind of problem So to understand why we're getting a timeout in this plugging We need to introduce another agent the L2 agent This is the agent that runs on the hypervisor. So on compute nodes It's the one that configures the local switches So for example we are in your town or in the Linux bridge implementation the bridges corresponding to the network and So on and it communicates with the server with the neutral server over RPC and It's main task. It's basically to wire new devices Where by devices I mean a tab interfaces that are created by Nova and are connected to VMs and It's also the agent in charge of applying security group rules And as I was saying before they are implemented using IP tables and IP set So now let's see how the the VIF Plugging works More in detail So we start from Nova compute that receive a request to create a VM So Nova compute will ask neutron to allocate to the network at point one and at the same time It will use a VIF driver to plug the interface in the local Switch point two so that the tab is plugged into br int for example in the OBS implementation and The L2 agent are constantly monitors for Updates in interface. So we'll notice that a new interface was added at point three and Of course, it will try to wire it and to wire it it needs to get some information from the neutral server So at point four will ask for the device details To the neutron server so the neutron server will then Know the port ID and the host ID of the Host where the L2 agent is running. So it's able to bind the port of point five So to write the association between port ID and host ID in the database And then when when the L2 agent is done Wiring the device it will just a point six Notify that the device is up So when when Nova at point one Makes these requests to neutron it actually starts a timeout the default values five minutes So if no one doesn't hear back from neutron in five minutes, it will throw this With plug-in timeout error So if you got this error the things that you should check is of course Grab for errors in the neutron server and the L2 agent locks also the novel Nova locks could be very useful and Something to notice is that So if your system is very slow or if you're Performing some kind of stress test It might happen that you just need to tune the configuration values. So it's it's not really like a Bag it's just that you didn't You need to adjust your value to to match the speedness of your system So you can for example increase the with plug-in timeout in the novel configuration and give more time To plug the interface or you can increase the RPC thread pull size and RPC Can't pull size to make the processing faster So now I will quickly go through a useful tools That you can use to to debug Neutron or networking issues So first of all to to get general info about the system You can use IP other to display all the addresses that are on the machine You can use a route the shan to display all the routes IP table the shell to display the IP table rules ARP to see the ARP table of the machine and Then TCP dump. This is a very useful tool It's it basically displays all the packets that are going through a machine or to an interface and There are lots of Effects that you can use like for example, you can specify an interface with dash I So you'll get only the traffic that it's flowing only through that interface You can specify a protocol with dash and You can specify for example virtual bridge with the dash and the eye You can run TCP dump inside an M space you can use logical operator like Dash and ARP or and see MP if you want to get both ARP and IC MP packets and You can also use the Dash I any if you want to get the traffic of all the interfaces on the machine Then To work with namespaces we use IP net and s You can list all the namespaces on the machine with IP net and s list You can execute a common inside a namespace with IP net and s exec and Yeah, there are several example of what you what comments you can execute in in a namespace Then a specific to open this switch You can use of yes VSTL show to show the configuration of the bridges on the machine You can use of yes the PCTL show if you want to know more Regarding flows and heats and miss of the of the ports on the bridges You can use of yes the PCTL's dump flows To get a dump of all the flows install is installed in the in a machine or you can get The flow that are corresponding to one bridge with dump flows and specifying the name of the bridge or Specific to one table With dump flows the name of the bridge and the number of the table And then for a Linux bridge you have BRCTL show that will show the configuration of the Linux bridges on the machine and you can also just have the configuration of one bridge specifying the name of the bridge and So I put at the end useful links if you want to Know more about this topic. These are links that I found well done and useful and Yeah, so there's a feedback button in this summit so that you can give feedback Of the session that you attended So I suggest you to use this button so that for next summit we can select better talks and So thanks everybody and if you have question we have maybe a couple of minutes Can you go to the microphone, please? Okay, you have talked about that with plugout right time out. Yeah, so I I'm just wondering like if something goes bad the neutron right and neutron could not respond to know Well, so what happens to the resources which is already allocated by neutron So whether the neutron will do a cleanup of those resources or how to handle like sorry because I can't hear you very well Is it perfect? Okay, so what I mean to say is if if some kind of error happened in the error scenario Okay, so what what neutron does with the resources already allocated? Let's say I create a boat, but I could not give it on time to know why I know what does the time out So what neutron will do with the created resource? Okay, so what happens now is that I let's say neutron is too slow So you got the time out Nova will clean up the resources so Basically will tell neutron to destroy the port So that's what happens now and actually you can notice that it's your system that it so that you should adjust your Configuration value just because that happens. So you see that on the neutron side everything is fine So the port was created the interface was blocked just it was not done on time Maybe it was done. I don't know 10 seconds later and then you get the Order from Nova to clean up and you see that neutron cleans up Okay, so whether we are seeing some kind of synchronization issue with this any any other problems being noticed No, I don't think it's a synchronization issue. It's just that I mean in this specific problem Of course you have to take into consideration the performance of your system So if your system is too slow, you just have to adjust the configuration value I agree that it could be done automatically, but I mean It's not there at the moment and it requires quite some work Okay, thanks Great talk Rizala. I Was was wondering about option 121. Yeah said serious does not support it And what's the requirement for VM image to support option 121? Okay, I would say yeah, I would say it's supported by 99% of the image, so it's very standard so it's It would it would be really weird if it's not supported, but it's a possibility Thanks, okay, I think we any other question or we can close here. Thanks