 Good afternoon everyone I Hope everybody's ready to sit down learn a little bit about neutron. I am And so we'll see if we can't figure it all out together this afternoon like to spend a little time looking at Probably the number one question. I I'm my name is Phil Hopkins and I work for Rackspace Obviously with a slide up there and I go get to travel around the world teaching OpenStack and One of the most frequent questions that happens is well I got this all open stack up But how do I make the networking piece work or I spun up a VM? How many get this problem? I spun up a VM and I look at it and didn't get an IP now. What do I do? Or I tried to ping out. What do I do? So we're gonna spend the next 40 minutes now This is a distillation of a three-day class So we clearly aren't gonna be able to do three days worth of work in 40 minutes But we're gonna take a distillation of that and look at how to troubleshoot neutron problems and so let's take a few seconds to start this process here and the whole idea is we're gonna look at how to troubleshoot neutron problems and If the gods of presentations will smile upon me today We're actually gonna do this in a demo form as opposed to slides Which will be hopefully more interesting if they don't I have the backup with the slides here So let's talk about the troubleshooting process one of the things before we get started jumping into how to troubleshoot neutron It's to talk about how to troubleshoot More problems. I see is somebody we see somebody come up post a problem on the IRC Or I've worked with people time and time again troubleshooting They see a couple symptoms or somebody tells them this is broken. They think about it for a second They're immediately jumping in try to fix something They'll tweak this well that didn't fix it so I jump over and tweak this and in five minutes. You've got a system that's almost They've tweaked so many knobs There's no way of getting it back to the beginning and this troubleshooting process is very important It implies you need a couple things one is you need an understanding of how the system works that you're trying to troubleshoot if you don't know that you need to go read the great manual Because you got to understand what it should be before you can differentiate between what it's doing right now and We're it's not doing the things it should be so it's taking that time to understand the technology To understand what's going on behind the process and then collect data You have a customer typically or somebody come in our problems report. You get some terse thing You need to go investigate does it really can you simulate that can you get the results that they were getting if so? What does that imply? Have you looked at your system gathered data about it look down into? Well, if that's happening whether things should be happening to see Where things were working and where things aren't and find out where that disconnect is It's only at that point when I've gathered enough data. Can I then start looking at? Considering what could cause that and what could be the solutions? Again, so many people jump in and start twisting knobs before they throw fully understand What's going on what's going wrong and you need to take time together that data So we're gonna do that today in a problem. I've set up here where the VM doesn't get an IP What do I do in a few minutes? We'll look at that But it's getting that data that we have that background so we can then consider the causes and the solutions Based on that we should be able to go say I think it's this Then we should gather more data to see if it appears to be that Whatever it is is not working right and only then do we go say okay? I think I've convinced myself. That's a problem if I tweak this knob. It should fix it We go tweak the knob and it's still broke. What do we do next? Well, the first thing we should do is set the knob back where it was and go collect more data. I Can't tell you how many times I've made that same mistake. Well, it's not that I think it's over here and We break it so bad that we have takes hours to fix a simple problem So it's going through this problem solving process But gathering data spend the bulk of your time gathering data if you do that you find fixing the problems relatively straightforward Well, we get those problems. We've gathered data We've gathered data and we still don't know what's going on and an open-stack neutron the place to go is the IRC channel There's a lot of help There's a lot of folks willing to help out there and if you've really read the manual you've really tried to understand the problem And still stuck in a problem. There are a lot of very knowledgeable people that will be there to help you So again, remember this process take a chart like this Etch it in your brain and remember gather data gather data gather data before you ever go into starting twisting knobs So let's talk about open-stack networking and I put up this big chart here. This is a One I want us to look at and spend a few minutes on because this is a block diagram of how neutron Actually wears things up on the compute node and if you'll go to I think it's the administration manual. There's a slightly different view of this but it's a similar type document that you can go look at to see what the flow is and We can see that in this case. We've got three VMs and two tenants up here So two tenants on two separate networks of the VMs plugged in here in this case The VMs are going into open v-switch, but there's a bunch of what are these little blocks between the VM and open v-switch Well, something changed in open stacks It went from Folsom to Grizzly and on and what that was is the ability to engage something like called neutron security group Neutron security group rules are applied on the compute node between the VM and open v-switch Well in Folsom the VMs were plugged straight in open v-switch, and if you understand the Linux stack at all user space program plugged in the user space program Never hits IP tables and so we needed to put Cause traffic to go through the kernel local kernel IP tables So we could filter at that point so we could control data going in and out of VM Specifically we want to make sure that if some bad person was on a VM. They couldn't do things like spoof their MAC address Well, if I filter it anywhere, but right in that path between open v-switch and the VM They potentially could do bad things and do bad things to that machine So we added something here the tap interface being plugged into a Linux bridge and that Linux bridge Then having a v-ease pair plugged in one side of the v Linux bridge and to the open v-switch process What that then forces is all the traffic coming from each VM to go through IP table rules and The first thing the neutron security group rules do is Filter that traffic and specific they will filter the traffic making sure that the traffic to and from the VM Had is addressed to the MAC address that open stack thinks that VM should have and the IP address that that Open stack thinks that VM should have So if you on your VM go changing the IP address or trying to spoof your MAC address You're going to guarantee the traffic is going to go off that VM ends IP table rules and hit a drop rule It's not going to go anywhere. In other words, don't do that It's one of the things that we need to do for security So what happens if we haven't normal traffic flow comes out of VM through Linux bridge through the IP table rules and open v-switch and Open v-switch we have in a normal configuration. We'll have the integration bridge Which all the ports from the VMs plug into and that's connected with a tap Connection if you please between that and either a tunnel bridge or a bridge for Vlands I'm showing one for GRE tunnels if we were doing Vx lands that would look just like this Look very similar to this if we're doing Vlands But that data to keep the network separate as it comes in from VMs gets tagged with a VLAN tag inside open v-switch So when that data comes in for a network and if you notice on your right or your left my right The two VMs that are on the same Network come in and their data their packets are tagged with a VLAN tag That is calculated on the fly as open v-switch starts as the neutron Plug-in open v-switch agent starts up It will calculate that based on all the networks that sees in it in this case It was given on the on my right your left It was given VLAN tag 2 and VLAN tag 3 and for the other one So as that data comes into that open v-switch ports, it gets a VLAN tag gets put down and then the bridge tunnel Bridge tunnel will have flow tables and the principal thing those flow tables will do is encapsulate the packets But it'll give a g re tunnel key Based on the network so each network will get a different g re tunnel key This keeps all the network data separate it comes that out of that and it will go into The network and eventually should connect to the network node on the other end Network node if you notice there's lots of colors here. It looks very simple. It'll come in Come in to the bridge tunnel flow tables now. We'll strip off the Gery tunnel key or VXLan ID or the VLAN ID Tag it with an internal open v-switch VLAN and eventually that'll go to the integration bridge and based on its internal VLAN tag We'll go to the proper either DHCP namespace or router namespace for that network And we can see in this case We have both of those on this way have VLAN tag 2 for one network and its DHCP namespace And we have one for the the other network with the other For the other DH namespace for the other network is VLAN tag guarantees to keep that network traffic separate for the DHCP requests now by the way is anybody understand how a DHCP request goes What exactly happens there because it's this process We even need to know the bottom layer packet things that happen to understand how things are handled and in neutron Well, if you understand a DHCP request the VM does what sends out a broadcast and we won't So it sends out a broadcast packet that will eventually be received by that DNS mass process and what it will do is it will send a unicast packet Back from that broadcast that is a IP address offer it says I got your request. I see your Mac address Here's a free IP address for you. It sends that back. Are we done yet? No, the VM will come back and say, okay Another broadcast packet. It says I see this. I want this IP address It sends it back to the DNS mass processor your DHCP server DHCP server gets that request packet now from its offer and it sends an acknowledgment a unicast packet back That will cause the VM to say, okay, I now have my IP address I can set that up and I can go with that. So we actually have four packets being exchanged It's important to understand that it's also important to realize that two of those packets are multicast those coming from the VM in this case and the returns from the DHCP server are unicast Now any other communication starts with what I'm not paying another machine. We send out a ICMP request, right? Now actually the first thing we do is we send out an ARP request Which is again a broadcast and then we'll get our response back and Then we can start sending and the ARP request is a broadcast the response is a unicast Then we start sending out unicast ICMP Request that's very important because as we look at the flow tables how the flow tables work an open v-switch We need to understand exactly how this process works. I Want to make another couple of comments here, too as we troubleshoot How many still use if config to go find the IP address of your machine? Okay, take your hand that's out there and say don't do that If you haven't noticed in the man pages I If config is deprecated With the advent of the IP route to package. There's a number of commands that are deprecated if config route Net stat, you know all these commands are all familiar with we'd love to use those are all deprecated And there's a set of commands even the ARP command is replaced by some in the IP Set of commands in this case you need to understand IP address IP route there's this IP network namespace and if Since we use network namespaces in neutron heavily when you're using overlapping IPs You have to understand how network namespaces work And so you're gonna have to use the IP commands in this so you have to get used to that and the last one is IP neighbor Which is the replacement for the ARP command now if you haven't used it one of the nice things about IP neighbor That shows that the ARP command does not show is it shows whether this is a active or stale route now how long does a Typically now we can tweak it and there's other things that vary it, but Typically I send the ARP request I get it back how long before that ARP request by default and Linux kernel before it goes stale Anybody know how long that is? Well, I'll give you the answer. It's 60 seconds by default And if it goes stale then I need to contact that what happens there It's understanding neutron troubleshooting neutron. You need to know how packet flow works And what happens in Linux kernel because it's through this process We figure out what's going on and if you don't understand these basics you're gonna trip your up with the fact of how all the flow tables work And what happens well by the way if that gets stale it'll send the packet out to the end because it still sees it And then immediately thereafter it will send the ARP request also So as route goes stale it's gonna try to communicate, but it's also gonna send an ARP request That's very important because well as we get into that Let me also mention the fact that IP tables is something critical to know in Troubleshooting neutron and now how to use it a couple of the options that we tend not to use I have a dash dash V that should be just a dash V Dash V is very helpful because it gives the packet statistics for every packet that match that rule or So it goes through that rule so we can actually use a watch command to see how packets are progressing through the various IP table change If we suspect that's where it's being lost line numbers is also helpful because it gives an Order number order of the various rules within a chain that makes it again other commands the ping the host All these commands are important for us to remember because we're gonna need to use those in troubleshooting neutron Also if you're not familiar with Open vSwitch you need to spend some time familiarizing yourself with open vSwitch Get it up and running and use these commands to see what kind of data we can get out to see what's going on inside open vSwitch because most of neutron Implementations end up using open vSwitch unless you're using Linux bridge or there's a few others, but most of us are using open vSwitch and our implementations today the VSTL command the show to see the Basic configuration the O of CTL the dump flows so we can see the flows in there We can also you show to see you'll find that and via CTL it gives us names on our ports, but in the flows it uses Port numbers, but we end up having to go that mapping back and forth So the OF CTL show on the particular bridge You want to see is very critical to get that mapping so we can figure out what data is going where and then app CTL allows us to also do dump flows But the other nice thing about it is the fdb show will allow us to see what Mac addresses The various bridges have learned So the integration bridge for example in the case we showed up there where there's two VMs Side by side plugging into the same network That traffic can go straight into open vSwitch and immediately on one port immediately out the other one When it's going from VM to VM in that same machine It doesn't have to go through open vSwitch and the flow table is out to gerry tunnel back in It's just gonna get routed around and part of it is because of the way Open vSwitch learns the Mac a tape Mac addresses on the same vLan So again some of the information lastly My voice is struggling today. So please forgive me and lastly open vSwitch Allow something called port spanning or port mirroring depending what your background it is It allows us to echo the data from any one port either data going into the port or out of it to another port What we can do is create a vEath pair plug in an open vSwitch mirror certain ports to that and Use TCP dump on that mirror to see what data is going into ports another tool in our toolbox to see What's going on going on an open vSwitch because this is the black box that we tend to get confused with and not understanding what's going on So we can it's used to mirror traffic It says selective ports and it's very useful for debugging and there's Let's get into then the flow tables and open vSwitch a change happened in open vSwitch and the flow tables as we went into the Grizzly from grizzly on Or maybe in Havana, I forget which one in that prior to that everything went into we just used everything in table zero The problem with doing that is that sometimes we had to populate that with all the return paths and that could get very Complicated and I'll explain what I mean there what we now have is table We have the flows are divided in a bunch of different tables and depending what happens here If we look at the rules what we care about is this Right before actions if you notice the very first rule where it says Duration five seven five nine four the line below that the next one It says import three is going to take import three and resubmit that into that resubmit comma two So that's going to go to cable to table two And if we see the recent next line is going to resubmit that to table one as well We have a couple more table twos Table two is the next one that packets are either going to go to table one or table two Let's hit table one table two one is basically packets coming from Outside open v-switch coming in from a gerry tunnel came in a gerry tunnel and It needs to go to a VM typically rule one is going to look at and if you recognize what that rule is that first rule is going to match on Broadcast packets so any outside broadcast packet is going to match rule one and that's going to get resubmitted to table 21 The second rule is going to match on unicast packets is going to get resubmitted to table 20 So remember this table 21 and we're going to get to in just a second does Broadcast packets table 20 will handle unicast packets The packets that came in from VMs get injected in the table to here and Again, if you notice these Packets from the VMs and let me see I may have Go to table 20 if you notice this will get submitted. This says table 20 This is going to go to table 10. Actually, let me Correct an error on the screen. I have table 20 and 21 being Tables one and two table two is actually from the outside table one was from The VMs because now they both go to table 20 and 21 and I just caught an error on my slide I've got to get fixed here Because the table the packets coming from the outside are all going to go either to Actually, they're going to all go to table 10 and table 10 is an important table that got added to to open v-switch But and I'll talk about it what it does in just a second tables coming from the outside are going to go into Table 10 and they're going to have based on their tunnel key If you notice it's going to match on a tunnel ID Based on their outside tunnel ID coming in this case We have tunnel IDs one and two they're going to get modified to the internal v-lan for open v-switch So we're going to take that packet strip off its Jerry tunnel key give its internal v-lan ID and submit it on to table 10 Table three you'll see in there is not used table 10 does an interesting thing What that one has an interesting if you notice in the third line under it table 10 that says actions And it says learn what table 10 is going to do is going to inject a rule in the table 20 to cause a return path so that when the VM response of that packet whatever that is there's going to be return pass set and To open v-switch flow table. It's basically when a packet comes in from an outside world We're going to actually at that time cause a path flow table entry to be created to allow the VM To respond and that send it back to the proper space if we look at this is actually learning the MAC address From the GRE tunnel endpoint and the tunnel key ID so we can wrap all that correctly into it And send it just to the proper GRE tunnel endpoint not to all GRE tunnel endpoints to minimize our traffic so this one will cause a Rule to be entered in open v-switch and it sets a timeout of 300 seconds 300 seconds is what? Five minutes remember how that our request was going to stays active before it goes stale with 60 seconds Well, it guarantees that if a V this rule will stay in there longer than that art table request so if a VM goes and puts a Broadcast out with our request that response from the our request is going to cause the return from that Machine to be populated open v-switch So the VM can continue to communicate and that rules stay in there for five minutes It will last longer than the local machines our request So we guarantee that the rule stays in place for a VM trying to talk to something the return packet comes in we know how to talk and Lastly check table 20 long table what that one does again Has actions depending on what it's going to do It's going to guarantee that unicast packets are sent to their proper spot and if we look at table 21 similarly and If you notice also that that very first rule in table 20 that I have up there is one of the rules that got populated By a packet having come in from the outside that you look at a lot of similar information But you see where it's added in the very first one says Vlan TCI it learned the Vlan it came in It also learned the Mac address it came in from and it learned the tunnel ID that came in on so it Populates a rule that guarantees that packet will go to the proper Mac address of the one Destination point it needs to do for the GRE tunnel endpoint and gets encapsulated the proper tunnel key Lastly channel table 21 handles broadcast packets in this case I have not turned on L2 population and that means that broadcast packets will be sent to all GRE tunnel endpoints for that network and so this handles broadcast packets I Don't have time to go in deeply into open V switch rules And I'm going to run out of time if I'm not careful so we need to go on I know there's lots of questions read the manual It turns out that actually in the open V switch the OF CTL man page Has a lot of good information about reading the open flow rules Spend some time there it will be your friend and you'll be well served by learning that a couple Slide real quick if you want to set up a mirror for open flow open V switch You can do the create the v2 pair with the IP link add and then you can set the one endpoints up You can add that vEase pair then into the bridge internal bridge. I turn a lot do that all the time integration bridge And then the next command actually sets the bridge so what this is going to do is cause the packets that go on port ETH1 patch tunnel and Brint those bridges and the last line where it says create mirror It's going to say the source those and all those from the destination Those packets going to be reflected into my output port which is one end of my vEase pair Then I can set up TCP dump to listen on that other end of that vEase pair Which would be vEase one which is outside open vSwitch and I would see all the traffic going into and out of those ports and you can lather rinse and return or You can change the ports as appropriate, but this one will set it up And then the last one allow you to clear that bridge out of it a Command that a lot of folks are not aware of it. It's very useful is called the neutron debug command Neutron debug command is documented in the CLI manual Not going to spend much time on it where it's handy is I can actually spin up a DHCP namespace And a probe namespace without having spin up a VM Which can be real handy to check out neutron as basic functionalities without having to try to spin up a VM It's a way of testing is my neutron piece working right before I bring up a VM It also has another command here called ping all which allow me to ping or I can ping either ping every known VM that I have for a particular tenant or I can add it a particular namespace Network ID and just ping all the assigned addresses inside a particular namespace It'll come back and it's a way again a way of testing communication now Realized neutron debug is a super shell the neutron command so if you try pin neutron debug space help You're going to see all the hundred and forty some odd neutron commands plus the need bug commands So you're going to find that's probably not helpful to do that. You're probably going to need to Feed that and grip it on probe or ping to see what you want to see for the neutron debug commands But the CLI manual documentation does support that To troubleshoot neutron. There's some things that we need to do and we need to do before we jump in We need to go figure out the Mac addresses and IP addresses are various VMs they're involved Why because again layer two layer three communication lets us to see what's going on here We need to know what the DHCP server and router IP address and Mac addresses are That way when we do TCP dump on the packets, we'll know what's coming and going where also from any our Data network nodes in other words the for using GRE tunnels or Vlands or VX lands What are these in-point IPs that they should become communicating on I? Strongly recommend when you configure open v-switch that you separate your data plane and your control plane Makes life easier keeps your data Traffic separated from your control traffic, but also allows us to look at the purely the data plane traffic alone We need to look at is this a problem universal across my open v-switch we're having is it Located to one tenant What protocols are being used again like I talked about even basic communication will use Broadcast and multicast packets. What's going on? Is this an L2 or an L3 problem? Well, if I spin up a VM it doesn't get an IP It's an L2 problem because I haven't set up the basic L2 communication So we need to look at that Examine locate we need to look carefully what's happened and take time to Troubleshoot through be very careful through this. We typically don't spend some time there and then we need to isolate it to Tenet a network a VM or a compute or network node. Where is the problem? Going and where is it broke? At that point once we've collected data consider the causes We may find we need no more data go through the process again again. What is solutions test and Only at that time we started adjusting things and one thing at a time folks Resist the temptation. I know every time I succumbed to this temptation. Oh, that was I know what it is now and I started adjusting I Know when I do that it's a wrong thing and sure enough I prove it every time It's the wrong thing one thing at a time go back Keep a log of what you've tried. That's real important too because I tend not to keep logs I'm sloppy in that regard and Then I get down this process. What is it try? I'll find out. I've tried the same thing two or three times before I I'm getting old my memory goes and you know those things happen, but keep a log. It makes your life easier Couple other things before we look at example I have one more slide here, and then we'll get into the example I've got a few minutes to do that traffic levels One of the things that open v-switch supports is s-flow Network nodes can get Trapped into too much traffic setting traffic data s-flow connector is real helpful and Things we can watch for so at that point. Let's look at Shrink this down. Let's look at a real VM here. I'm gonna come in here. I have a VM right there I'm gonna go into that hopefully It will work there. We can go see We come down to the bottom list. You can see I'm logged into seros I get into this and I do an IPA and look look what happens here. No IP We've all been there. So I'm gonna do a pseudo su-dash I know this is the thing everybody tells you not to do but we're gonna do it here We're gonna do a u dhcp here, and I'm gonna do a dash uppercase t for one second Second time I'm gonna dash a one second and what this is gonna do is gonna go out and send DHCP request packets on my network While I'm doing that is let's then go to the compute node Which will be this machine right here. I'm gonna go to my compute node Let's do a TCP dump Let's actually let's back up. Let's do an IPA and look at what we have here and I'm gonna go to this interface right here collect that because that's interface that's plugged into open v-switch I'm gonna do a tcp dump. I like the dash e option because that Lows me see the layer to communication going Get that one and if we watch it sure enough we see Broadcast packets only going up from our VM, but now return which is not surprising That's kind of what we expect in this case So TCP dump says what's going on. Let's do something else. Let's do a watch I'm gonna quote. I'm gonna do an OVS dash OF CTL Dump flows Thank you. I wasn't watching was typing here. I can type today Now, why would I do this? Well, let me show you one of the things that happens when you dump the tunnel of the flows We can't see them all but if you notice that my duration and my number of packets number So now I can see what flows are that packets hitting now If I scroll through and get it bigger where we can see it all you can see all the flows But if you look at right here in this line right here, which is the broadcast when I see my packets line incrementing Now if I watch very carefully in rules one and remember up in table one And table two I said some of those table one is going to be the packets coming out from my VMs and Table two is going to be my packets coming back. None of my table two stuff ever increments So I know my packets are going out bridge tunnel and open v-switch. Nothing's coming back So let's kill this and if we really want to do one more thing here just to look at it we can look at I think each one is my and sure enough each one is my Packets going out you can see the packets going leaving open v-switch leaving my compute node and Yet nothing coming back you can see those are boot DHCP requests And you can also see the fact that it tells me that's a GRE tunnel packet with a key of one Which is what I expect. Okay, so things are working there the way I'd expect it Let's come in. Let's look at the network node. I Think that is that this one here. Yep. Okay, so let's do a TCP dump Dash n dash e dash I on e1 So the packets are coming in. Yep, they're coming in good Now we can do the similar type thing we did before we can do an OVM Let's do a watch. I want to hit the equal key there I can do that watch the packets come in and sure enough. I watched a counter there I can see the packets coming into it. I can see them matching the various rules Looks like what's happening there should be so let's go into here So let's do an IP net NS. Let's get our network namespace here Let's grab that and Then let's grab that and then let's look at my IP address is inside that network namespace I'm gonna see this tap interface. Let me grab that and so let's then come back. Oops. There's the right key TCP dump Now when you're working inside of a network namespace use the dash L option if you don't You'll start it and you'll see no packets Then you do a control C and then on the packet show if you want them to show in real time You need to use the dash L option and we'll grab that right there And we watch and sure enough wait a minute the DHCP packets are coming in my DHCP server should be running Let's see what's going on here. Let's check something. Have I do a ps a ux? Grab for DNS Mask for that Hmm. No DNS mass process running. Well DNS processes mass process of my DHCP server Let's try something here Restart the DHCP agent process Now let's go inside my network namespace and see what's happening here Well, I'm not getting any traffic. Why not? Well, I'll show you why not we see our request. Well, that's a good sign Let's go back in here and Look at my machine here. Oops It got its IP which is why the DHCP soon as that you DHCP process gets its IP it stopped So in this case we took a situation where I could not spin I spent up a VM could not get an IP We track the request packet through neutron to the endpoint We saw that it wasn't getting the DHCP process Address because why well the service wasn't running we restart the agent we get the service up We get an IP address simple, right? Yeah, we know how neutron works sometimes We're about the time. I think we're over time. I can need to cut it off here. He says he's gonna drag me off stage Hopefully this is useful to you. Hopefully you see a few of the basics of troubleshooting neutron We went through a real-life process and I appreciate your time If you have questions come on down I'll answer some in the front here what I can and Because they're gonna have to set up for the next person