 Hi. Good morning, everyone. Thanks for coming in. So my name is Sadeek. I'm working forward as a cloud success architect. This role involved helping customers, a design, build, and deploy OpenStack cloud successfully. So today I'm going to explore the anatomy of Neutron from a troubleshooting point of view. So a brief agenda on what I'm going to talk today. So I'm going to explore. I'm going to take real life examples to show you how to explore the anatomy of Neutron and to show you how we can troubleshoot a Neutron problem. The first problem is related with Neutron security groups. And the second problem is a failure to get a DSCP IP address from Neutron DSCP server. And third problem, random failure for while connecting to Neutron floating IPs. And finally, the fourth problem, communication through provider networks are very slow. And it's not working at all. It is very slow. That means we cannot use that. And finally, I will cover some of the lessons learned during the troubleshooting. So before I start, I just want to set your expectations. So we are going to explore a limited anatomy of the Neutron. That is, the time is not enough to explore the entire Neutron anatomy. So I will explore the anatomy of Neutron that is related to the problems that I'm going to speak. Then these examples are real-life examples. And the problem and the solution that I'm going to explain are specific to the version where we hit the problem. So today, if you go home and try to reproduce that, using the Neutron version, you may not be able to reproduce that. So your focus should not be on the problem and solution, more on understanding the anatomy of Neutron and try to understand the troubleshooting steps that we followed so that you can solve a similar problem in future. So let's go ahead to our first problem. That is, Neutron security group rules are not effective. So the problem is that I created a Neutron security groups. I also added some rules into the Neutron security groups. And I created an instance using that security groups. Then when I tried to reach out to the instance, obviously the rules that I specified for SSH and PING are working, no problem. But the problem is that all other network communication to the instance are also working. So this means the security groups rules are not effective. And obviously the first step, I'll try to understand whether it's a problem in the way that I created the security groups. I tried deleting the security groups, creating a new security group, and attaching to the instance. And finally, just try to attack the default security group where everything should be blocked. There is no luck. Whatever I do, everything to the instance is allowed. I can PING SSH. If I bring up an HTTP server, I can access the port 80 of the instance. So here, the problem starts trying to understand why every rule is working, even if I have defined the security group properly. So the first thing is just try to understand where exactly the security group rules are applied. If you have explored the anatomy of a compute node, specific to that instance, you first need to go to the novice and get the name of the instance. Then you need to understand where exactly the instance is running on which compute node. And then try to get the port details for that instance and get the port name. And the port name, then you get the TAP interface associated with that port. So the communication from the instance comes directly into the TAP interface. From the TAP interface, it goes through a Linux bridge. Then it goes through the OBS bridge on the compute node. So the first thing, you get the TAP interface where exactly security groups rules are expected to be applied. Then you understand how the security group rules are applied. So I have one other slide that explains the change each and every security, the packet goes through when it reaches the TAP interface, when it comes to the IP tables chain. So the first thing, every packet to an instance enters the forward chain. And then from there, the packet then moves into a neutron open view switch forward chain. Then again, the packet that are expected to be for the instance goes to a neutron open view switch SG chain. So this may have some changes in the neutron in the flow, but this is the flow when we retribute the problem. So then from there, it identifies the incoming packet using the minus-minus-fist-of-in rule. Then it moves the incoming packets into another chain. And it moves the outgoing packets into another chain. So the packet that reaches the incoming chain, so this incoming chain is where exactly the security group rules are entered. So you will see a security group rule for PING and SSH to allow ICMP and SSH access inserted into the neutron open view switch hyphen OXXX chain. From there, the packet checks, but then the target for that rule is return. Return means send the packet back into the previous chain if it matches the rule. So if it matches, then it sends the packet into the previous chain. From there, it again goes through other rules in that chain. If there is any, if no, the default policy is accept, it will accept that packet. If it is no, then it goes to another chain called a neutron open view switch SG chain where the default rule is dropped and the packet will get dropped. Same is true for the outgoing packet. If it meets the return rule, then it is accepted, then it is dropped. So we are trying to understand, I explain to you, the packet need to be inserted into the new, I mean the rule need should be there in neutron open view switch OXXXX. So yes, the rule was there. Rule was there in that chain to allow PING and SSH and deny all other packets. Can we have the questions in the end? The O is incoming. I had the confusion. I had this confusion. I confirmed that. So the rule was there, but still those rules are allowed. But again, everything else also allowed. So this also, we tried our best to understand the rule is there and everything, but still there is no solution. So the next thing that we try to do is do IP tablets logging. That means you can add this rule into any chain. That means then you see in your VAR log messages what are the packets going through that chain. Then depending upon the rule that you have created, it will show you details of that rule. So in fact, we try to do this and we do not see any logs in VAR log messages that is pointing to the packet that we sent to the instance. So this clearly says the packets are by passing the IP tables. It never goes through any of the IP table chains. And this led us to investigate further and see what are some of the global parameters that may be available in the Linux kernel, which might bypass the IP table chains for a packet for a Linux bridge. Then we started hunting for that parameter. Then we found that there is a parameter, netbridge, NF, called IP tables. If you set it to 0, then all the packets will bypass the IP table chain. So this was exactly the problem with the deployment one of the customer did, because the deployment tool, when the compute node was provisioned, automatically it had a kickstart post. It sets this value into 0. So you will not be able to reproduce this in the latest version, because in Newton or in Mitaka, now it dynamically sets this value into 1 whenever you start a first instance by overriding the value that you might have set during the deployment. So that was our first problem. So I hope you will be able to troubleshoot security group issues in future. Hope. So let's go to the second problem. So that is the DSCP IP. Newly created instances cannot get DSCP IP addresses. The problem is that there are a lot of instances that created previously. If I try to renew their DSCP lease, or if I try to reboot them, they are successfully getting the DSCP lease. But if I launch a new instance, it does not get the DSCP IP address. So the first thing that we did, we tried to explore how DSCP is configured in this environment. So in this environment, DSCP is configured with DSCP agents per network set to 3. That means for each DSCP neutron network that you are going to create, there will be three DSCP server instance running in active active mode. Assuming that there are three different network nodes, DSCP agents running, it will create a DSCP server for that network on each DSCP agent. So the beauty of DSCP server is that the high availability for DSCP is built into the protocol itself. So that means you don't need fancy tools like Pismaker or KIPA LiveD to manage the high availability of the DSCP servers. So how this works is, suppose an instance is trying to get a DSCP IP address. It first sends a DSCP Discover packet. And this is a broadcast. And all of the DSCP server will get that DSCP Discover. And all of them will respond with a DSCP offer. And the client then chooses one of the DSCP server to get the IP from and sets the server identifier into the packet header, into the packet, and sends replies with a DSCP request. All of the DSCP servers will get the DSCP request. And the server which has the identifier set will only respond with a DSCP ACK. Others either does not respond or they respond with a DSCP no ICK. That means, finally, the instance will get the DSCP IP address only from one of the DSCP server. This is not something OpenStack specific or Neutron specific. This is how DSCP works even outside of the OpenStack or Neutron in a physical deployment as well. So first, we try to understand this. So if you try to TCP dump each the DSCP server, you will see the flow. It's receiving the DSCP Discover and offer and then request. And there is no response. Only the system which has the DSCP identifier server with the DSCP identifier responds with a ACK. So once we understood this, then the next step is we try to identify the layer to flow between the instance and the DSCP server. The DSCP server is running on the network nodes and the instance is running on the compute node. So the first thing we try to do is try a TCP dump on the eth0 of the compute node and see what things we are seeing here when the instance request a DSCP IP address. We can only see the DSCP Discover. Then here, so that means we don't need to do anything here because everything is all right here. Then we try to do the TCP dump here and we only see the DSCP Discover, nothing else. So that means the tunnel is fine or the VLAN communication is fine. Then finally, we try to do the TCP dump on tap XXX. We only see the DSCP Discover, nothing else. So that means the layer to flow is good and the packet from the instance reaches the DSCP server on all the servers. But the DSCP server does not respond with any offer for the newly created instances. So this helped us to focus our troubleshooting on the DSCP server set and try to understand how the DSCP server is working for Neutron. So obviously, this is how the DSCP server works for each network on each DSCP agent. It's going to spawn a DNS mask process with a host file. This means this host file has the IP address and mapping for each instance that is going to be created. And the DSCP server is configured to respond only for those instances. Any other request it comes into the DSCP server will be dropped. So we explore this file and we see that this file has entries only for the previously created instances or old instances. And newly created instances does not have the IP address and MAC address mapping in this file. So good. This led us to investigate the next stage, who is responsible to update this file. So this means actually the ACP agent through the message queue communication, it is responsible to update this file. And then we try to dig into the DSCP agent logs and we see a lot of entries like no queue DSCP agent in host for this in all of the DSCP servers. This means when we further investigator, this means there was a disconnection between the DSCP agent and the message queue. But the specific version of the DSCP server agent was missing the code to reconnect in this specific scenario. That means it got completely decoupled or broken from that message queue. And it didn't update the dehost file for any new instances got created. So the immediate solution was to simply restart the DSCP agent. We could have done this in the first stage itself. But if we do that, we could have missed the entire troubleshooting and the backporting the patch. As a permanent solution, our developers backported the patch from the upstream version that automatically reconnects if there is a disconnection. Good. So this is the problem with the DSCP agent. So let's get into the next problem that we have, that is connection to floating IPs randomly fails. So if something is working, I'm happy. If something is not working, it's still OK because I can go ahead and troubleshoot. But think about a scenario. It works for some time. It does not work. After some time, it again starts working. Then it stops working. So your random failure is really painful to troubleshoot. You have to get to the bottom of this to troubleshoot this. So when our customer reported that they have a random failure while connecting to some of the floating IPs, we first try to understand how Layer 3 agent is configured in this environment. So in this environment, the Layer 3 agent is configured to spawn three, I mean, neutron router instances for each router. So this means L3 HA is configured to true. And the maximum L3 agents per router is set to three. That means if there are three network nodes, then we are going to have three L3 routing instances for a specific router. So once a router is created, it will have default gateway IP from the private network and a gateway IP from the public network. And this IP is going to be active only on one of the routing instances. And the other instances are running as passive mode. If the primary instances goes down, the KEEP LAPD is responsible to move the gateway IPs into the other instance and make the routing and floating IPs working again. So we first try to understand this. And then VX LAN tunneling is used for network node and instance communication. The floating IP used here was a VLAN provider network. VLAN provider external network. Then finally, let's try to understand and explore the anatomy of how Layer 3 agent works. So what you see on your left side is the communication on the compute node. What you see on your right side is actually the anatomy of a Layer 3 agent for a router on the network nodes. So as I explained, there are three network nodes. Each of them are going to have the QR way, way, way is going to hold the default gateway IP for the private network. And the QGXXXX is going to have a base IP from the external network, plus all the floating IPs on that. And this IP is going to be active only on one of the nodes. If this node goes down, the IP will fail over to the QR way, way, way, and QGXXX of the other node. And the HA port is actually responsible to keep the heartbeat communication between all the nodes for keep alive D. And it sends the VRRP and make sure that the other node is up. If the other node is down, if the heartbeat is missing, then the IP will fail over to the one of the other node. So in this case, we verified that one of the node has the IP address. It has all the floating IPs and the private network gateway IP. And let's go to the next slide. Let's try to take only the anatomy of the active node. And the active node has VRInt. And the external network communication is between in VREX and the BHEX, and connected to VREX, then it is 0, then to the external network. So this is how it was configured using provider networks, provider external networks. And so the first thing we tried to do is try to ping the default gateway IP from the instance. It's working. There is no packet drop. It's 100% successful. So that means the communication from ATL0 all the way to the QRXXX is successfully working without any issues. Then we tried to ping the base IP of QGXXX. No problem, this is 100% successful. So there is nothing wrong in the anatomy from here till here. Then the next thing is we did inside the namespace, we tried to ping the default gateway that is configured in the external network. So when we tried to ping that using IP-natenness, EXX, QRouter, then ping the external gateway IP, then we are successfully able to see the random failure. 10 pings works, then five pings does not work, then 25 pings works, then 50 pings does not work, something like that is purely random. So we got an area to focus. We need to focus here to double show this further. And so I mean, this is what I just explained. So the important thing that we tried to do here is this is an OVS bridge. And if we do the OVS app CTL, FDB show, BREX, we see how the switch learns the instance MAC address and to which port it need to be sent the packet. So we see this, we kept a watch on this, and suddenly we see when the instance ping stops, the port number, port and the MAC mapping suddenly flaps into one. So this, again, we only know why it happens, I mean, what happens, but we need to find out why the flapping happens. So this, again, so first need to understand what is two and what is one. So for that purpose, you go and run OVS CTL show BREX. So this means one is ETH0 and two is PHY, BREX. So if the instance has successfully worked, the MAC address should be mapped into two because that is the route to reach the instance. If it is ETH0, the packet goes out. It never reaches the instance. So then we try to run a TCP dump on the ETH0 interface. And interestingly, we see that everything that we send out, we see a duplicate request, be it ping or whatever it is. But so suppose it's an air request that is going through, from the instance, then we immediately see two air requests, one reply, two air requests, one reply, this goes on. So then still, that clearly says this is a loop from the switch or a loop from the external network. Whatever you send through the ETH0, it comes back into the ETH0 itself. So just because of that, the OVS bridge learns that this MAC address is outside of the bridge. Then according to that learning, it changes the MAC address and port mapping. So then, when it works, if you go into the instance, it's straight ping outside. Again, the OVS bridge says this is coming from the instance inside, coming through the PHYBREX. Then it again learns that the packet is inside, then flaps the port mapping. Then again, loop comes in. It changes the port and MAC mapping. So this is the root cause for this. In this specific case, we had to get this result from the switch side or from the hardware side. There was some configuration of, I mean, I stole this ETH0. But in that exact case, it was a bond 0, LACP bonding. And the LACP configuration, there was some mistake in the configuration that was done on the switch. And the packet was looping through the slave interface. So this is an important point that you can do, enable debug logging within the OVS. Specifically, the VLOG set of DPIFX like debug. So you will exactly see that what happens inside OVS. It clearly says, learn that MAC address is on port ETH1. And when the flapping happens, it clearly says, learn that the MAC address is on PHYBREX. So this is really important. So the toughest job here is to prove to the network admin that there is a loop. They always think my system is working perfectly without any issues. So you will have to have a lot of research, TCP dump results, and send to him and convince him that there is a loop if you want him to look at the switch side. So that was the toughest job in this case. Because this also involves LACP, because the LACP is not configured properly. And there should be a lot of problems when you configure with bonding. So let's go into the next problem. There is the communication to provider network is very slow. This is, again, one of the painful thing to troubleshoot. So the important thing first you need to understand how the provider network is working. So let's quickly jump into the next slide. So in short, the provider network is actually bypassing the neutron networking to some extent. And you are going to create an instance directly into the external network. And the instance that you're running on the compute node, the compute node should be wired into the external network directly. And it directly contacts the gateway on the external network. So this means this is the anatomy of a provider network looks like on a compute node. The packet comes all the way to the BRN from 8A0. Then from a BRN, there is a patch pair connection between in BREX and PHY BREX. Then it all the way goes from BREX through the physical interface, bond 0.301 or whatever you have configured. In this case, it's bond 0.301. Then goes to bond 0. Then through the slave interfaces, it goes to the external network. So let's try to understand how this was done here. So in this specific case, the BREX is added into the bond 0.301. Then it goes through bond 0. Then through the slave interfaces. Then immediately this, but the VLAN provider, then the external network was created using this command, which is network type VLAN, and with a segmentation idea of 171. So you can smell what is the problem here. It was a user, basically. Because we tried to first try to do a TCP dump. And we saw that the packets are getting fragmented completely in all the TCP dumps. So then we tried to lower the MTU on the instance to 1450. And we saw that everything is working perfectly without any issues. But is that the solution? Obviously no. This is not a VXLAN network. So VXLAN has an overhead of some bytes. So we need to lower the MTU. There is a non-issue that we can accept as a solution. But this is a VLAN network. So if it is a VLAN network, and if you are going to suggest that you lower the MTU, then you are missing the real problem here. So we tried to dig deeper on why it is not working with the 1500 MTU. We did a TCP dump. And when we analyzed the TCP dump, we saw that the packets are getting tagged two times. The tag is 171, then again it is getting tagged with another VLAN tag. So the packet has two VLAN tags inside it. That means it always exceeds the 1500 MTU and causes fragmentation that cripples the network communication. It makes the communication very slow. Then we tried to understand why it is getting tagged two times. So this is where exactly you need to understand how the provider network flow works. If you create a VLAN provider network with a provider VLAN ID 171, then you have VREX and bound 0.301 and bound 0 configured to send the packet out of the external network. This means the neutron adds a obvious flow within the VREX due to this rule saying that if a packet comes in here, which has an internal VLAN tag, when a packet comes from VR into VREX, there should be an internal VLAN tag. You strip the internal VLAN tag and add 171 to the packet. So you've got one VLAN added. And again, you send the packet through bound 0.301. That means another VLAN tag gets added into that packet. That is a 301. Then it sends that packet into bound 0, but when it reaches the bound 0, it has two VLAN tags. So obviously this is the wrong configuration. So what is the solution, actually? The solution is the internal VLAN tag is here. When it comes here to the pH where VREX, it strips the VLAN and adds the 301 VLAN. Sorry, the earlier command I used 171 as the VLAN ID. It should have been 301. Then it goes through here, bound 0.301, then bound 0. Obviously, a double VLAN tag gets added when the packet reaches here. This causes all the performance problems. And to make it working, it should have configured bound 0 like this. You don't need to create bound 0.301 and add bound 0.301. If you do that, then you need to recreate the network as a flat provider network instead of a VLAN provider network. If you want that, configure it that way. But that is not a scalable model. So this was a mistake or this was configured like this where we're misunderstanding by our customer. So we rectified it. It was a simple mistake by the customer. But to get to the bottom of that, we had to spend a lot of time and efforts. Because you may see that you should have first explored VREX and see how it was configured. It's not practical because the anatomy of neutron is too complex. You don't know where to start with. You may start from the wrong side. All the way you reach into the right side, you should have spent a lot of efforts there. So finally, some of the lessons learned throughout the troubleshooting steps that you should be aware of that collecting the prerequisites to start the investigation to troubleshoot a neutron problem is time consuming and confusing. That means, obviously, if you have to get the instance name, you have to identify on which compute node the instance is running, then finally you have to get the name of the instance that it's being represented in the worst list. It's a different name than the instance name. Then you need to understand the port name, port number, then the internal VLAN tag used for the network on the compute node. Because the internal VLAN tag is going to be different for the same network on different compute nodes. Because that internal VLAN tag is local to that compute node. So collecting this information, then you have to write it down completely. Then we need to apply value troubleshooting. Collecting this information is too time consuming. Then too many hopes to run TCP dumps for troubleshooting. So you don't know where the packet is going to fail. So if you start from the instance, there are a lot of hopes that you may have to run TCP dump to understand where the packet is failing. That means if you work with a customer, you get a bunch of TCP dumps, 10 or 20 TCP dumps from different interfaces, then you see where the packet is showing up, where it's not showing up. This is a big effort. Sometimes the understanding of the obvious topology is time consuming. There is a tool to mitigate this. I forgot the name of that tool. It didn't get time to search for that. I thought I will update this later on, but I missed that. So when you run this tool, it will give you a graphical overview of the obvious topology on the compute node and network node, that this is how the packet is expected to go through BRInt and from BRInt to PSY, BREX, and then all the way to the external network or things like that. So do not assume neutron is always wrong. Obviously, neutron is this part of this. And it depends a lot on external components, like DNS mask, T-pal ID, and a lot of other things, like kernel configuration and the networking configuration, open-v-switch. So your problem may not always be in the neutron. Your problem may not be deep in the kernel on open-v-switch or something external to your open stack environment. So another problem is hunting for expertise. And each of them is challenging. So if you suspect that something is wrong with DNS mask, you need to find an expert person who knows how DNS mask works. If you suspect something is wrong with keep AliveD, then you need to hunt for a person who knows how keep AliveD works under the hood. So it is really difficult to build all the expertise in the same person who is trying to troubleshoot. So you need to reach out to different people to understand how things are working. And finally, when you troubleshoot neutron, you should have to tread a lot of wrong path before you get into the right track. So this is expected. You will first focus on the wrong area, where nothing will be wrong. Then you finally isolate a lot of things. Then you should have gone through a lot of wrong path before you reach the right path. And this is all I have for this session. There are some interesting sessions coming up from Red Hat. So you are recommended to go through an analysis. One important one is that open stack troubleshooting is so simple, even your kids can do. So if you are afraid that neutron is too complex, by seeing these examples, you are recommended to go there and watch that. They will make it easy for you. And finally, questions. I don't know how much time we have left. But if there are some quick questions, I'll be happy to go through that. Yeah. Want to use the microphone? So thanks for the talk. It was really interesting. I'd be really interested to know about this tool that you mentioned, so if there's a way that we could perhaps get a link to this, this would be really valuable. Tools, do you mean? The tool to draw the topology, this would be very interesting. Sorry, I think. Actually, I'm using the Google Slides. I use the Google Slides itself to draw that topology. I see. I didn't use any other tools. Oh, I see. OK, so I think I did the same thing like Google Draw. In Google Draw, I'd say. Yeah, this is exactly what I thought. I thought if there was an automated tool, I'd do that. There is no automated tool. I draw this manually using the Google Draw. OK, understood. Thank you. Thank you. Yeah, sure. Yeah, just a little quick. Thanks for the interesting session. So I'm curious about when you do the troubleshooting. Besides the OBS, do you try any other virtual switch like Linux Bridge? And how would you see is there any difference between the Linux Bridge and OBS in the real world environment? So I remember using Linux Bridge in the Grizzly and Havana. After that, I have never had any problems being reported, or I have not seen too many customers reporting problems with Linux Bridge. And when it comes to, so our focus is completely on OBS and the native ML2 driver for this talk. When you come to the other plugins, third party plugins, there are a lot of plugins which I don't have any exposure into troubleshooting them. So we actually usually collaborate with those people like Noij or whoever is providing that plugin to do the troubleshooting. So specifically, the Linux Bridge, I think for the last two years, even I haven't worked with the Linux Bridge. Yeah, OK. Thank you. Thank you very much. So if there are any questions, I think thanks for coming in. That's all I have.