 Hello everyone. I'm going to talk a little bit about troubleshooting Neutron, a real quick talk. Some quick background first. We're a public cloud provider, so we have all kinds of customers using the service. We're eight sites in six countries and everything is built on a standard OpenStack components as far as we have been possible for us. And in this Neutron case, what this is based on is our experiences with Neutron, ML2 and OpenVSwitch. We started out with IceHouse once upon a time and nowadays we're in Metaca, Newton or Ocata depending on where in the upgrade cycle the actual installation is in. First a few words about where to look when you're looking for a CN issue that has to do with Neutron or the networking of the installation. The number one thing is, of course, log files. You have all the logs from Neutron, of course. Adients, you have log files from operating systems, you might have network equipment that generate logs as well. Configuration files is always interesting to look at as well. Both the static configuration files I mean Neutron.com for ADN configurations and so on. But also the dynamic configurations that, for example, VPN as a service generates or load balancers as a service generates to see that they meet the expectations of what they should be concerning your configuration of the services. Also, source code diffs, reviews, bug reports, of course. Especially if you, for example, find a bug after doing an upgrade, it's really interesting to be able to look at diffs and see what changes has been made in between the releases. And the other way around as well to see that if you're running an older version, you might want to look at, okay, what I've changed in this place where you have found that the problem is probably located and see what code has been changed there to be able to maybe find out if it's fixed or not, even if you don't find a specific bug reported or not good enough at searching. Also, the IP net NS exec command is really useful. You can execute commands in the network namespaces and see how different things is set up inside the router namespace. For example, in this case, it will just show the IP addresses and the interfaces, but you can also do TCP dumps and stuff like that inside the namespace. And TCP dump is always your friend when you're doing network analysis of any kind. Also, the OpenVSwitch command, so VS CTL show, for example, to just see the ports and bridges and so on that has been set up in the OpenVSwitch or OVSTP CTL dump flows where you actually see the flows that has been configured into the OpenVSwitch. And all the OVST commands is basically interesting and you should check them out and know what they do. One challenge we had was to make VPN as a service work. We had a number of different problems there and we had to troubleshoot and fix to get VPN as a service to work. This was over time, all those issues and not at the same time, so I mean, some of them, most of them are fixed in one way or another nowadays. One of them was that we were running an implementation with Strongs1 on CentOS and actually the template file for Strongs1 missed a lot of parameters. So it worked if you were creating a VPN tunnel between two similar OpenStack installations as the configurations would be the same at both places, but not correct. But if you were doing a VPN connection to something else, your VPN appliance or so on, it would, of course, fail. That was fixed and we found that it had been added a few days earlier in repos. So we could just sharepick those changes and everything worked fine when it came to those templates. When it comes to next one was just a simple no-filter-matched error that they were missing some room. So we had to make sure that we didn't miss any of them. So we had to make sure that we didn't miss any of them. So we had to make sure that we didn't miss any of them. And the next one was just a simple no-filter-matched error that they were missing some root-wrap policies to be able to execute some commands that the VPN agent needed to execute to make things work. The next one is quite interesting. The problem there is that when you do a ping from a router namespace and try to ping over a VPN connection, you would get no buffer space available error. That error is generally due to the XFRM for GC threshold being set to low. And the first thing there was, okay, if you set that value, that CCTL value on the actual host, the network node or and so on, that will not help as it's not set inside the namespace. So that's the first thing you had to fix, that actually make the namespace creation code or VPN agent set that value on creation. The second problem was actually a kernel bug that there was a problem with the there was a problem there, even if you had set the threshold correctly to a large number as you needed, the kernel bug was that some counters were affecting between network namespaces. So this was only applicable if you had multiple namespaces, for example, in an open stack installation or other types of similar installations. And that one was fixed, we could find the bug report for it, and it was actually fixed in the kernel, later versions of the kernel. And then some other just very simple updates of connection status that wasn't incorrect, so that was just a sharepict fix as well. Another problem, you know, it's not a bug. Another problem that was interesting was that we saw that we intermittently customers lost connectivity to external networks. The report said something like, they were losing connectivity for a few minutes now and then. And we couldn't find anything in any relevant logs. And also an important thing here is that we were running L3HL. And that will be important, obviously, in the next slide. Another thing was that a ping from a virtual machine or a router namespace fixed the issue for a while, until it came back the next time. The issue here was pretty interesting. So, you know, it's not a bug. It's not a bug. It's not a bug. Pretty interesting. The thing there was we started out with doing some TCP dumps and seeing that, okay, traffic is not ending up at the master VRRP instance, not the master HA router. So, we started looking at the switch switches and saw that the MAC address table were changing, which means that in the working case, it was directing the MAC address of the router to the correct network node. In the other case, it was not. It was different port channels in this case. The problem here was that when you have forwarding enabled, IPv6 forwarding enabled in a router, you will actually answer some... You will actually subscribe to some multicast listeners, listener queues, and you will respond if you get messages from those. And in this case, if it responds to something or sends even one packet out, the switch will see that MAC address is located behind another port channel. This was fixed in later on. We fixed it, worked around it, but it was fixed later on and we found some bug reports concerning it. Another thing that's a bit interesting when you have a public cloud is that you have all kinds of customers, and not all those customers you actually want. Some of them might be looking to do denial of service attacks or stuff like that. So, we have to sort out the bad apples, the bad customers that want to do bad things to other customers or to other citizens on the network. Some things that will help there are quality of service and net flow and traffic protection. Those things are interesting, but we've also implemented some scripts that are talking to the Libvert on running on the compute nodes and analyze them and alert on traffic, basically get the BPS and packets per second for all the virtual instances. You can analyze that and alert on misbehavior. We also push this data to graphite so we can easily generate a graph like this with the top five packet producers. Another thing that we've seen lots of problems with is resinking OVS agents and L3 agents. We have seen this both after upgrades, after reboots, after OVS agent restarts, and it has been very time consuming and you have seen disturbances on the data plane. There was also a kernel bug that when it's creating network namespaces, it got slower and slower than the larger number of namespaces were created. That bug is fixed a long time ago, but it was really affecting the resink. It's much better now than before and lots of work has been done on preventing data plane interruptions, so it's working much better now. Also, L3HA, of course, will help. If you have a problem on one node, you'll be able to move the router, it will move over to another network node and that works. Also, I'm out of time, so I'm just going to say one word about this. You have to be really careful about the MTUs in every part, both hardware and network equipment and operating systems and Neutron. There has been some bugs, but lots of work has been done there as well now, really recently on the MTU handling in Neutron. Basically, they should match and different network equipment might have different definitions. Some of them think that it's including headers, some don't, some include VLAN tags and some don't and so on, so you have to be really careful and read documentation of the equipment. Also note that you actually set the physical network MTU in Neutron. If you set it, for example, if you use VXLAN and set it to 1500, the physical MTU, the MTU of actual interfaces in namespaces and so on will be 1450, as the headers will build up there. If you want it to be 1500 there, you have to make sure that your physical network can handle larger packets and you have set it to 1550 in Neutron. The last slide here, it's finding issues before the user does. That's always what you want to do. Of course, there you want to analyze the logs, you want to put alerts on everything that could be interesting, and you want to graph as much as possible to be able at least to find the problems afterwards and visualize the things that are much easier to see in a graphical way. The most important thing of all, after each failure, do analyze why there was a problem or why you didn't detect it or why you didn't see it and improve over time. That's really important. Thank you.