 Hi, I'm Matt. This is Sean, Kevin, Ihar. We're here to talk to you about an interesting topic over a very long time called MTU, which seems to have been a cause of massive confusion and problems over several years and quite a few releases. So this time around, we decided to figure out, well, why don't we fix the problem in neutron that we've been having and kind of take a look at what actually is MTU and why has this become such a problem? It's actually not that difficult. So it'd be kind of interesting to explain where all this came from and how these problems came about. So the objectives of this learn about MTU and physical network. So where does MTU come from? It's been around for a very long time and it's only that now that we have virtual networks, things have become kind of a serious problem. Learn about the nuances, and particularly with virtual networks. So tunnels, whether they're VXLAN or GRE or whatever, add a little bit of overhead to your packets. And this has been kind of what caused the primary problem in neutron. Review some of the confusing MTU options and workarounds, or as I would call them, hacks and releases prior to metaka. So some of the ways that we've figured out kind of how to get around the MTU issue in some situations, and the big deal with those is they're kind of indeterminate, very specific as to how your setup works. Apply the MTU knowledge to reveal issues in OpenStack. I mentioned OpenStack as opposed to just neutron because NOVA plays a role in this as well. If you recall, there's some bridges on the NOVA side that do security groups, and they also have to have the correct MTU. And we'll learn about the MTU solution in metaka. So we'll just recap what exactly a maximum transmission unit is just briefly for background. It is the largest network layer three data unit that the underlying data link layer can pass between a transmitter and a receiver. So the common standard is 1,500 bytes for Ethernet. However, many devices actually support larger frames, which are commonly called jumbo frames. Now, when you have a transmitter and a receiver where the packets are crossing multiple networks, you run into the issue of where the transmitter and the receiver may not actually have the same MTU. So within layer three, there are ways that you can actually discover the MTU for a path and then be able for the transmitter and the receiver to negotiate MTU so that then communication can proceed. So with IPv4, there was a RFC where they discovered that even though packets can be fragmented and then reassembled on the receiver side, the performance was extremely slow and any loss of a single fragment would require a retransmit of the entire packet. So they put together a proposal to do what is called path discovery for MTU where they would send the do not fragment bit in a packet and when the MTU changed whatever hop that was would send back an ICMP packet saying the MTU is too large. Send me a smaller one and it would actually include the MTU of the next hop so that eventually the receiver and the, I'm sorry, the sender would then figure out the MTU and then be able to send it all the way to the receiver with no packet reassembly with much better throughput. With IPv6, they removed the feature of fragmenting packets and then they will generate MTU, there will be an MTU, the smallest MTU is 1280 bytes where it's the same type of operation where if you hit a hop where the MTU is smaller than the packet size that you've sent, it's sent back to the sender and the negotiation happens and then if it is successful, the communication between the sender and the receiver can continue. So just this is an illustration of this if you have a host with a jumbo frame set and if there is a part in the network where the MTU shrinks before you actually hit a Layer 3 device, the packet has actually dropped and since there is no Layer 3 device where the ICMP protocol is available, the sender actually never receives anything about what happened to that packet. This is also just an anecdote where if you have middle boxes such as firewalls that block ICMP, this would also cause traffic issues as well because if you block ICMP between a sender and a receiver and they never get the ICMP control message, they have no way to adjust down their MTU. So as you can see, this is where we have a Layer 3 device in between where there's a host that has a 9000 MTU and a host with 1500. You can see that as the traffic crosses the boundary of the one router, the ICMP packet is sent back to the sender which then adjusts its MTU accordingly and then it's able to pass through the traffic down to the smaller network. So I'm going to hand it off to Kevin. So inside of ML2 there are different ways that the tenant networks can actually be set up for their traffic to be carried on the real provider of the operator's network underneath. This impacts the MTU that the instances will need to use to safely pass traffic onto the network. The easiest one to start out with is the flat type and that's just using Ethernet pass straight through so if an instance sends a packet it's just passed directly onto a physical interface so the instances MTU can be exactly the same as the underlying physical network MTU. And VLAN is similar because the VLAN tag that's added to the Ethernet frame doesn't impact the maximum payload size that can be sent onto the network so the instances in that case can also use the exact same MTU that's configured on the operator's network. Where this changes is with overlay networks that use encapsulation in the higher level protocol so like VXLAN or GRE we're taking the tenant's Ethernet frames packing them into an IP frame and then sending that onto the real Ethernet underneath so you have to subtract all the overhead that's added by those outer frames and use that as the MTU inside the instances otherwise by the time everything gets packed up it'll exceed the MTU of the underlying network. So here we have a diagram that kind of illustrates this for VXLAN as an example so if the provider's network underneath is a 1500 byte MTU we have to take off the VXLAN overhead which is comprised of the IP header, the UDP header, the VXLAN header and then the Internet Ethernet header for the tenant's Ethernet traffic so that only leaves 1450 bytes left over for the instance MTU to use so that has to be advertised or configured via metadata on the instances to work on a 1500 byte underlying network. For GRE it's basically the same thing except it's not based on UDP so it doesn't have the extra UDP header so it has an extra 8 bytes that the instances can use in the 1500 use case because it only has an 8 byte GRE header and the inner Ethernet header and the IP header on it. Okay, Matt. So some interesting observations about various kinds of open V-switch and Linux bridge are pretty much the two most popular things and one thing we found during testing of this so the code basically said one thing and what I found in tests actually revealed another thing and I was like, wonder what's going on here? It turns out that Linux in and of itself manages MTU fairly well so for example if you put an Ethernet interface with an MTU of 1500 and you plug a VXLAN interface into it the VXLAN interface automatically accounts for the extra 50 bytes and says 1450 on it well this sort of interferes with anything in Neutron that's trying to override that but it kind of depends if you're using open V-switch for example a lot of the components are in open V-switch say on your network node or where your L3 agent runs but where your instances run there's a Linux bridge that's doing security groups for you and that's pretty much managed by the Linux side of things so a couple of notes about with open V-switch is it kind of internally doesn't really have an MTU or it's arbitrarily large and you kind of have to treat it as sort of just like 65,000 bytes for example and that's okay until something leaves open V-switch and then it hits like a real interface with an MTU of say 1500 so you might be able to send a 9,000 byte packet somewhere into open V-switch and it looks like it's fine but if you have an MTU interface with a smaller MTU it may just mysteriously get dropped looking at the Linux bridge side of things as I said it automatically configures tunnel network interface MTU by subtracting the overlay protocol overhead other things are bridges bridges assume the MTU of the lowest MTU interface plugged into it so if you have two interfaces at 1500 plugged into a bridge so this is where things get interesting when you're trying to manually set stuff or when Neutron will set an MTU when it builds a device if you plug something later into that device with a different MTU it changes and Neutron doesn't actually know about it the other thing Linux allows you to do is virtual Ethernet pairs or VEATH pairs can have different MTUs on either end and this is roughly equivalent to having a switch with two different MTUs on ports and there's no layer 3 in between so one thing we found was fairly significant is that VEATH pairs would have different MTUs inside Neutron and drop packets mysteriously so people would have a lot of trouble saying well I can't find where my packet is getting dropped and turns out it's because it was inside of a VEATH pair so some of you are probably familiar with some of the ways that people have sort of worked around MTU problems in Neutron and a lot of this I said it's indeterminate it may look like it does one thing whether you are using tunnel networks or VLAN networks or mixing stuff in a router you can wind up with any number of situations and I know there was some talks yesterday about troubleshooting Neutron stuff and I think a lot of it ended up being kind of MTU issues that were sort of hidden oh my VM worked and then I plugged another VM in with another kind of network and all of a sudden nothing works is it Neutron's fault? Not necessarily so a couple of notes here but it lacks obvious and consistent support for MTUs larger than 1500 1500 is pretty standard and works but I know a lot of you have 10 gig networks maybe 40 gig networks and you've said oh I want to provide 9000 byte jumbo frames directly to my instances how many here have had actual luck doing that? I don't see a lot of hands okay so I guess we're right about that one and the other thing was by default Nova create security group bridges and interfaces to provide MTU so even if you got the open v-switch side and all your L3 agent stuff to look okay as soon as you get a bridge created it winds up being 1500 and your packet stops there features claiming to address MTU involve confusing and often useless options and prior to about Kilo we didn't really even make a comment that it was supported then Kilo gained a couple of options that were really confusing as to what they actually did so advertise MTU was a Neutron core physical network MTU was in ML2, path MTU was in ML2 you'll see that these vary by the different kinds of either plug-in or agent you're using or whether it's available to your plug-in or it's in Neutron so it's a lot of different options around v-th MTU only works if you use v-th pairs with open v-switch and I think most people kind of got rid of those after getting away from CentOS 6 as my guess or older kernels network device MTU sort of disappeared Nova deprecated it even though it did have a purpose and that was one of the little secrets to getting larger MTUs to work only some plug-ins support the MTU API extension back in Kilo there was a value added to networks that indicate what MTU it is and your instances could read off of that and figure it out if you don't can't see it or you don't support it then it's hard to know and then of course the last one was what documentation that's a serious problem in general with Neutron so a couple of things that people were doing starting with Folsom this is about when the stuff came about environment general settings people would implement a slightly larger MTU on the physical network so say 1550 and then the VMs could use 1500 things would be fine with your tunnel network using VXLAN or whatever MTU you needed and then this causes an issue with networks that are not tunnel based because all of a sudden they can support a 1550 they might try to announce a 1550 but nothing else really supports that so you're moving the problem essentially your 50 byte issue is well my tunnel networks are not going to work but now my provider networks are not going to work or something like that so a lot of these issues just move it around based on what your architecture is you can manually configure DNS mask to provide a smaller MTU to your instances but this applied to every instance it wasn't just instances on a tunnel network so now your VLAN network your instances would get a larger or the wrong MTU usually smaller so you're fighting a battle here essentially Nova and Neutron those configure attempt to use network device MTU to configure the MTU of virtual network components kind of works we're going to see in a little while but hopefully it doesn't and then for the open V switch plug-in back when it was a plug-in and then turned into an agent if you're using v-th interfaces you can use the v-th MTU option which sort of worked too there's a lot of sort of here so for Kilo and Liberty a new option came in called path MTU advertise MTU you can mix some of these options together and the big one was advertise MTU because that means that the MTU of the network is off of the API value and passed into your instance via DHCP so now only instances on certain kinds of networks can receive hopefully the appropriate MTU for their network if you don't use DHCP you can configure it via some sort of other metadata that was the reason for having it in the API so that you're not locked into using DHCP also RA for those using v6 a couple other options in there attempt to use a variety of them actually so mixing segment MTU and physical network MTUs again and then mixing more options together so how many of you have mixed a whole lot of options together and hope something just came out of it that worked yeah so we're going to take a look at a couple common use cases and this is where I ran a bunch of experiments to figure out what actually happens with MTU and various configurations and various agents so this assumes proper configuration of underlying physical network so here's what we do for these tests we're using vxlan with IPv4 endpoints so it's 50 bytes of overhead if you do v6 endpoints it adds another 20 bytes to it cases 1 through 4 only use path MTU and advertise MTU and these were the only subtly documented options that came out in Kilo and so that's the ones that I think most people were trying to figure out how to use and then cases 5 and 6 will also see what happens with network device MTU because it changes the field a little bit so here's open vSwitch with a 1500 byte MTU and the way the colors work out here is stuff that looks green is working, stuff that looks red is not working and there's also a case of yellow where it might work depending on the situation so keep in mind that different services or different things that use TCP stack or UDP stack have better ways of determining the maximum packet they can send like ping for example unless you tell it otherwise always sends a tiny packet so ping always seems to work and how many here have had cases where ping works great and nothing else works in their VMs security groups are good everything looks good, you're like what's wrong you do an SSH with a debug mode you find it gets stuck in key exchange yeah that one, that's the NTU problem you've hit it somewhere so for advertise MTU equals true in the case of 1500 your VM or instance gets a 1450 which makes sense and then everything else in here assumes 1400 things just magically work looking at open V-switch agent with a 9000 byte well you'll notice the Linux bridge over there that does the security groups for the instance is still 1500 so you essentially have a layer 2 discrepancy in your MTU between 8950 at the instance and 1500 at the Linux bridge so your packets would mostly make it all the way in there you'll notice at the bottom here there's kind of the yellowy color at the router name space even though there's an NTU discrepancy here and it kind of looks like it's happening at layer 2 because you're eventually going from a switch you might remember OVS has this magically large MTU and the way that those interfaces in the router name space with OVS are created, they're not vif pairs they're kind of just a patch port that's moved into a namespace this will actually emit the NTU path discovery packets don't know exactly why but it does so if you have a large packet coming in the router name space will actually kick it back and say you can't do that however if you have a large packet originating from the instance it will get dropped so this is Linux bridge agent with 1500 byte MTU and you would think this would just work too well it turns out it actually doesn't not in all cases so there's a red section there you'll notice that the QR name space the vif pair so Linux bridge uses vif pairs all over and in the router name space you get a 1500 on the QR and as soon as it gets into Linux bridge it has a 1450 so you could run into potential problems there has anyone here actually run into problems trying to even use 1500 with Linux bridge before I know it comes up with various protocols so here's what happens with 9000 and you can see that basically the interfaces that are inside namespaces are simply not touched with their MTU they stay 1500 because these interfaces are not connected to any real physical device and they're sort of residing in their own name space Linux doesn't handle MTU calculations for it so you can see that packets are going to get stuck somewhere between the Linux bridge that goes to the external interface and the Linux bridge on the inside so the other case was network device MTU what happens when you set this so if you set this to 9000 you'll notice that there's an MTU discrepancy that shows up between your instance and the Linux bridge that's doing your security groups that's still there and you can tweak things a little bit to try to get around that like I said some people will just up their physical network into you to get around the 50 byte limitation and then there's also an issue where the OBS switch and when I say OBS components I imply all the OBS bridges I just wrap them into one when it connects to the tunnel interface that it has to deal with so your packet might make it through and OBS says well I'm going to add tunnel header to it so it comes in at 9000 tries to add 50 to it now you get 9,050 well your network doesn't support 9,050 so you've added a header that that tunnel interface is going to reject Linux bridge sort of a similar problem you'll notice that network device MTU does impact the interfaces inside name spaces so this was a good thing but we still have the same problem with the Linux bridge itself going towards the instances having 8,950 in it I'm going to pass it over to Ihar to describe how we're sort of fixing this so as we have seen we have multiple configuration options we have some code to handle MTU but it doesn't actually help in all of the cases and so we've looked at what we have set a goal you don't really you should not need to modify all those configuration options if you have a standard setup that just uses a standard Ethernet MTU size also if you need to change it to for example support jumbo frames then you should be able to modify just to determine a single value set it in the configuration file and be done with it and also another thing is that once you have this MTU calculated for your network it should actually be applied to the whole L2 data path that traverses the traffic traverses so that it actually works so for in Neutron we already have this code that calculates MTU for virtual networks the problem is that it was not actually enabled you would need to change the MTU value in the configuration file if it was not done then your networks effectively got an MTU value of zero which in several places of the code just disabled the feature including the advertisement of MTU to instances which is not nice so another part of the problem is that as was already mentioned NOVA participates in setting the the data path even if we would apply the calculated MTU just on Neutron side it's not enough at least for Jumbo frames so that's another part of the work that the value calculated should be propagated to NOVA and then NOVA itself should use this value to update the devices that it creates and also as some of you may probably know at the moment NOVA and Neutron community work on library to handle virtual interfaces called OSVIF it's not yet adopted by any of those projects right now but there is a plan to do that to make sure that we don't regress in the future this library should have also been updated in a similar way as we did on NOVA side so one thing one problem that we had with configuration options on Neutron side is that the options that control MTU calculation they were actually ML2 only so if you would want to implement a plugin and reuse this value that is set in configuration option you don't actually have a proper way to access it so one thing that we did is just we moved the existing segment MTU option from ML2 configuration file into the common one so now if you are an author of a plugin you can properly access that and implement your kind of MTU calculation also we changed the default value for MTU to 1500 which should work for most cases right and also another problem with the default values that we had was that even though mechanism to advertise MTU to instances was in place again it was disabled for some unclear reason so we enabled that so again another configuration option that you don't need to touch in Mitaka support for IPv6 advertisement was introduced because before the only mechanism that we had was DHCP option which is IPv4 only so which was not enough for IPv6 so router advertisement packets that we sent were expanded with the MTU information so now you can boot an IPv6 only instance and it will still get the proper MTU value set so one thing that was there was a bit of a debate on whether this single option is enough obviously in 80% of cases you have a single MTU for your underlying network and you just you don't want to have different virtual networks using different MTU values for underlying physical infrastructure but we already had some options to influence that that you could use to for example have different MTUs for different physical networks or use different MTU for for tenant networks that are passed through tunnels so those options were already in place those are path MTU and physical network MTUs so I think in the end we decided that we can't be sure that that no one actually uses or relies on that to leave them just in case but actively discourage people to use them because if you start having different MTUs for different physical networks it makes your troubleshooting process harder and there's no clear reason on why you even want to do that but the options are still there so one thing that we were talking about how we solved everything in Mitaka but actually not like we haven't reached the goal that we've set completely there are some glitches in the implementation one thing is that we've said before that we want just a single configuration option that you set in a single place and be done with that well actually in Mitaka you still need to set two configuration options to the same value one is the intended one global fizznet MTU but still you need to set the same value for path MTU option if you use ML2 plugin it's just a glitch in the code you need to be aware of that and in Newton it will be solved so another problem that we have is even though MTU calculation mechanism is in place and it's now enabled by default we don't actually apply it for existing networks so if you already have network resources created and they were created at the moment when the default the default values were not were not corresponding to the actual physical infrastructure it means that the network resources get MTU equals zero and then then your DHCP agent or router is not able to advertise the proper value to instances and we don't ever recalculate the value so at this moment if you want to switch to the new world and you already have resources the only safe way to do that is to recreate resources you obviously can get into your database and manually update columns column values there to correspond to proper MTU values but obviously that's not unsafe but we have it planned to be tackled in Newton we will probably enable this calculation mechanism like we will do it not just when we create the resource but every time you fetch the network so that's all cool, Mitaka, Newton everything is sort of fixed the question is who is using Mitaka right now okay, one two so the obvious question is what are we going to do with the previous releases because not everyone wants to switch to Mitaka right now so we looked at the set of patches that we came up to fix the MTU issues we had and we've identified the patches that are needed to fix the problem and there were actually not that many like four, five we still have some reviews up, not merged they span Nova and Neutron but we plan to lend them in the next minor stable releases for Liberty and obviously this patches mostly they touch the MTU value on the data path but they do not change the default configuration values because this is against upstream stable policy meaning that even with those patches you would need to set some more configuration options in your configuration files specifically you should enable the advertisement for MTU you should set this segment MTU option to reflect your MTU for physical network even if it's just 1500 also make sure that you unset the network device MTU both on Neutron and Nova side because those are kind of in conflict with the new approach and again if you have existing resources you would probably want to somehow handle this problem that MTU column is not really updated but again it's be cautious and as for previous releases Kilo and before we don't plan any kind of work we don't plan to backport anything so we are on your own obviously you can still go and try to identify the patches that you may try to backport it may even work but honestly guys you probably just want to upgrade so the next steps that we plan for Neutron obviously we need to tackle this problem with existing resources and also the old options should be deprecated, cleaned up removed both on Nova and Neutron side not just network device MTU but probably also advertise MTU should go away because there is no clear reason not to have it enabled but finally since the OS VAIF library is going to be introduced in Neutron we should make sure that it's properly adopted and that the MTU still works we'll see and the last thing to note is a lot the time when Neutron was not really playing nice with MTU was so long that a lot of deployment tools came up with their own hacks they configured DNS mask manually to pass some MTU values those hacks should be removed as soon as possible and we will work with some of those tools to do that that's it so just want to kind of conclude a little bit of what we talked about the main problem was that you can't change MTUs over layer 2 so when we moved over into the Neutron side of things all of a sudden we were having all of these issues where veth pairs had different MTUs veth pair essentially is a layer 2 thing or bridges for example and we were causing packet loss in various places so by a sort of experimentation looking at the code figuring out where all the problems were and realized that we weren't setting MTU everywhere where it needs to be and if there needs to be a change in the database and we saw from the experiments that the well you've seen from the explanation of what's going on in Mataka that we have fixed the actual underlying problem the options aren't necessarily as clear as we want them to be so for 90 or more percent of the networks out there you should have an underlying MTU say 9000 on all your physical devices and you tell Neutron I have a 9000 MTU on my physical devices for you unfortunately we have to undo several releases worth of options that are confusing not necessarily implementing what they should have so it's going to take a little bit longer than we thought to actually get this down to most people are going to use a single option and then for oddball cases there's going to be options that have not only good descriptions and documentations but the option itself perhaps makes sense so you kind of know what it references versus like so for example MTU, segment MTU how many noticed that there's a difference between those and what it actually means I didn't think anybody knew we didn't really know either so that was a whole lot of fun to fix so there's still a lot of work to be done but even if you have to set some crazy values now the MTU problem is resolved you can use jumbo frames inside Neutron and with your VMs so questions don't throw tomatoes at me alright thank you for the presentation just one critic and one question start with the critic if OpenStack was only for users to use on their laptops this conference will be not here in the basement of somebody else's house so once this is just this is addressing the enterprise in a carrier space saying that oh seriously upgrade to Miteka seriously the lifecycle management of a release is minimum 12 to 16 months for an upgrade that's minimum we have sites in two tier one operators still running grizzly imagine that when you say upgrade which is just released in this month in Miteka come on be realistic this is not possible okay so there has to be a way to patch back at least necessary stuff I'm not saying patch back to grizzly at least think about two releases back minimum this is my critic this is a good point M2 is a big problem especially for VMs running as a cluster as a back-end there's an overlay network providing overhead with GRE for example 24 bytes killing us all how about the things on the provider network side of the things because when you deal with the tenant network you have some sort of a control on the network fabric in the within data center but when you go with the provider network it's more kind of tied with the real network fabric of the whole IT infrastructure where you have nothing but you have to learn what is the M2 end to end you have any best practice to apply in the provider network side thank you well it depends on the type of provider network that you're using and I think that probably we'll be contributing some docs to the networking guide around best practices to use this correct Matt and then I guess the other thing is I do sympathize with your previous point about releases especially in the telecom space and stuff like this are supported for 18 months so on and so forth there's been many conversations that have been had in the open-set community especially on the dev mailing list about extra stable releases that are maintained but that's bubbled up all the way to the TC and those discussions are ongoing and I certainly encourage you to voice your opinion on that however OpenStack itself moves at an incredibly fast pace and I understand that you're worried about that but the thing is that there's only a finite amount of resources in the community to maintain releases or develop new releases and there's been a strategic choice that has been made to do more releases and try and maintain only a small subset into a stable and another thing to touch on too if you're simply using provider networks so you're doing VLAN or flat that's going straight into your compute nodes a lot of these issues don't apply to you so you can get away with using 9000 byte MTUs in those cases it's when you get tunnel networks involve VXLAN, GRE, all that stuff it's when things get a little weird and that's what we're trying to address here because that's where most of the problems came in I'll agree with El Sean it's definitely difficult to back port everything when it comes to long-term releases that's a whole other topic that's probably been argued a whole lot frankly MTU should have been solved many, many, many years ago but I can't turn back time any other questions? looks like we're out of time I think alright well thanks