 Yes, no, both will do. Welcome again. So this talk is about Dawn, which stands for diagnosing OVS in neutron. How many in this room actually think that they know neutron networking very well? Okay. A few hands. Good. So my hand would not have been raised to this question. So I was, I am fairly new, I'm fairly new to the open stack environment. So I was, and my expertise is on networking. So I was trying to play around with neutron networking and every time there was an issue, what I would face is that I just didn't know how to debug the system. And there was some help on the open stack page. And if you do a Google search, you'll get tons of information on it. And as the keynote speaker said during, during the first day's keynote, you shouldn't need a PhD to run it. Sure does feel that you need a lot of information just to able to figure out whether simple things are working or not. So we took this opportunity to work on certain automation steps so that the basic stuff can be very easily understood. And why did we choose OVS? There are multiple plugins, of course, but we chose OVS because it's still by far the most commonly used plugin. And there are way more information, there's way more information on the internet about OVS than any of the other plugins. So here, I promise to keep as many slides with pictures as possible and the pictures might have words, but... So this is the network schematic of a typical open stack installation. There's the management network which is needed for the open stack components to speak to themselves. And this, the addresses on this is not exposed to the outside world. Then there is the data network in green where the VMs actually communicate, the data communication between the VMs happen. Then there is the external network for, so that people from outside or from the internet can access this system. And then there is the API network which allows tenants to actually use this service. This API network is also exposed to the outside world. And in many cases, the external network and the API network might be the same. On this, OVS runs on these two places, wherever the neutron plugin agent... You can see the neutron plugin agent, those are the places where the OVS agent works. And so, even though the title of the talk is debugging OVS, but we found out that for very basic things, we also need to look at certain other things which are part of the system. So this is the picture straight out of the troubleshooting guide. So you can see that there are, I think, around nine devices or ports through which a packet has to flow before it can flow out. So for example, it starts with the VMs, a VM01, it has its Ethernet interface. It goes into a tap interface, the vNet0. From there, it's on the vEthernet pair, which is the QVBXXX and the QVOXXX. And from there, the integration bridge, from then the physical bridge and so on and so forth. Not all of these might be there in every setup, but primarily till the integration bridge, which is the red box that you see there. Almost everything is there. For my, for whatever results I'm going to, or the pictures that I'm going to show you, I've done it on a single VM DevStack setup. So most of the time, the br8, the Ethernet bridge is not going to be in operation. So this is just the compute node. And you don't have to pay too much attention to it. This is all available on the OpenStack page. And this is again on the network node. The network node actually has another picture that I'll show. So this is just the devices running on the networking node. And then there are also the, there is also the concept of namespaces on the networking node. So namespaces are, for example, for isolated L2 to operate with the same IP addresses. So that the IP addresses of one L2 domain doesn't overlap with another one. So this is the picture which is, which is the namespace. So again, if you go through the document and depending on your familiarity with the, with Linux networking as well as with OBS networking, some of this might be very obvious to you and some, but most of it would not be obvious. And certainly for a person who is not interested or is not knowledgeable in networking, this is, this is some, this is pure magic. I mean, you don't, you don't know what's going on here. You don't even know if all of this is going to be working when some of this might not even be working, and your system might still be up. So troubleshooting, this is non-trivial. Again, I will, I will request people interested to go to the OpenStack page. There is a troubleshoot chapter, chapter 12 is entirely on troubleshooting. And if you look at these steps, there are a ton of steps. And even if you know all of them, if you want to do them manually, it's going to take you quite a bit of time. And especially if your topology is large, then there's just no way you can do it manually. I mean, unless you exactly know that these are the nodes and these are the ports that I want to actually monitor. So don in, in a nutshell, what it does is it automates the basic troubleshooting steps. Most of it has been outlined in the troubleshooting guide, but till now we haven't seen anything which automates it or makes it just easy and quick for users to use. Okay. So hopefully the demo is going to make an offer that nobody can refuse. So I'm going to show a demo and the demo, as a matter of fact, I have basically pre-run lots of the steps in the demo just so that we can go into more of the details of how the output looks and what do I exactly do in order to get that. So, so this is the, this is the familiar horizon dashboard where I have a very simple setup. This is not a complicated setup by, so we have two private networks, the orange and the green, and each has two VMs running on them. And then there is another VM, VM four, which is connected to both the orange and the green. So it has interfaces in two private networks as well as the public network. And there are two routers which are both connected to the public and the private. So you don't need two routers, but I just wanted to see whether the system works fine with two routers in the system. So, so when you run the, so by the way, Don is written in Python and it's easily integratable to Horizon. So when you run the, run the script, what you immediately get is, you can zoom out later, is a view like this. So where to the left is the compute node and to the right is the network node. So I'm going to again zoom back in so that it's easier for you to see. So here it gives a view of the entire system where you first have the VMs lined up and then on each of the VMs what are the networks and what is the IP address on each of the networks. And then it connects to which tap device and which QVB device on the Linux bridge and then again on the connecting devices on the integration bridge. So you saw the first picture that I had given from the troubleshooting guide. It's basically the same picture replicated in a, automatically to give you exact configuration of the system, how the system has been configured now. The color coding is basically saying which they, they, you can think of them as the same VLAN. So all of these, so all of the greens use the same VLAN, which is VLAN tag three. So for example, there are configurations errors in which you do have a network, if you do have a interface on a particular network, but the tag might be incorrect. So here that would have popped up in a very, very intuitive manner. There would be red or something of that sort. By the way, this has not been jazzed up with any JavaScript around. So with that you can actually say right click and get even more details. This actually parses through a lot of information, a lot of, it goes into the system, runs a lot of commands, parses the results, takes those results in and runs follow-up commands and parses those results and finally comes up. All of this is finally concise into this picture. So similarly there is a networking, there's the network side view where there is the external bridge which has two routers and they have the QG devices or interfaces and then they are connected again with the QR interfaces on the integration bridge. And here again, each network has, if you notice there are devices which doesn't, there are TAP devices which doesn't have anything coming out. These are actually connected to the DHCP agents which are not plotted here, but you could plot them. So the networking view is router specific whereas the network node view is router specific whereas the compute node view is VM specific. Now you can imagine if you have a very complicated topology then this picture would be quite big and quite big app and more importantly doing that manually would have been next to impossible. So now once we have the picture, the bad thing about providing people with pictures is that the moment they have pictures they want to know more about it. I mean initially just getting the picture would be difficult but once you give them pictures they want to know more, which is obvious. So then we do some basic analysis tests. So for example in this we do some OVS tests and some ping tests and these are the very basic thing that we will, the ping test is probably the very basic thing that we'll do and OVS is once ping doesn't work you'll probably want to do OVS test. So and we think we found an OVS reporting bug because we don't see any functional impact of this bug but we do see that there are certain issues. So for example in the picture here there are only three tags being used, right? Three colors basically. Purple, green and I think this one, blue. So there are these three tags. So this just tries to do a particular test. So what it tries to do is that it flushes the MAC table and then it tries to learn a MAC on that particular tag on a particular port. So it sends a packet, a pseudo packet obviously, it sends a packet on a particular port and it verifies that it learned it on a particular port. Now it tries to send another packet to that learned MAC and tries to see whether it sends it out on the same, on the learned port or not. Very basic switch behavior, right? And even though the system is actually, and here it's like a, it's a JSON format output of what all it has done, it has the command and the output and all this kind of stuff. So this also gives you what exact commands that were running so that once you know that something is showing some problems you can go in and also do that manually probably. As of now, for most people it would be even difficult to just figure out what are the commands to run to check whether it's learning MACs currently, right? So at the end it's going to, it does all of this and so it actually does a fail which is the packet forwarded to incorrect port 15. So as far as we can see that, as far as we can say this seems to be a reporting error. There is no functionality impact we have seen. So Pritesh, who is the other, who is the co-speaker, he couldn't be here because of visa issues. He is an active OVS contributor. So he said that, yeah, it might be, and he's trying to follow up with the OVS community to figure out if this is a reporting error or not. So as a result of that, all of these are, the OVS test fails miserably according to this, but the system is actually running. There were, for the pink test there were two routers, right? And so for each of the routers it tries to do a pink test between all pairs of VMs and all pairs of interfaces. So you can imagine it's like, it's an exponentially, it's an N cross N test. So most of the time, so we'll see that most of the time all the, all the pink tests are passing. But there are certain things which are failing, and if we take a closer look, we'll see that always a particular IP address fails. The pink to a particular IP address fails. So now again, so this gives you a very basic idea of how the system looks. Are all pinks passing, which is the most basic test you can do, and we see that some tests don't pass. So now we, again, we have given this information. It's obvious for us to ask, okay, now let me find, if I do a ping, where is it not reaching? Why is it not reaching this guy? Why might be a more difficult answer? But at least let's find out where it goes to. So if you do a pink trace, so there is actually a pink trace button that you can click. And what it gives you, for example, this is a pink trace between two interfaces connected to the same network. So these are all the things that are lit up. So it starts from 10, 0, 2, 3, 2, and it goes to the tab device, the QVR, QVB, QVO, and all the way back up. So how is this picture generated? So we originally had the picture that we generated from all the connections. Then we actually run TCP dump on each of these ports. And we verify that the particular packet which has been sent has been received on each of these particular ports. So in this case, it's a successful ping. So that's why, and it is a ping within the same network, same private network. So that's why there is nothing on the networking node that is lit up here. Now we do a pink test between two IPs which are on different private networks, private one and private two. So the left-hand side of the picture is very similar, but now we see that some other stuff on this has been lit up. So which is how it actually goes. You can notice that this is VLAN tag 2, which is the incoming VLAN tag, VLAN tag 2. And then it actually sends it to VLAN tag 3 and from where it goes out to the other network. So both of these are examples of where the stuff works fine, where the ping works. So finally, we have an example of a scenario in which there is some issue. Remember that 10.03.6, we were not able to ping it. So we do the same ping tracing on this. And we see that these are all the colored nodes and nodes where we expect them to be green, but the red nodes are where we don't see the packets anymore. So something happened. So it reaches all the way to this thing. So basically it works fine in the VLAN tag 2. Somehow it's not being able to forward from VLAN tag 2 to VLAN tag 3. So this, again, we might now want to ask, why did it not forward? But this talk is not going to go into that. I haven't reached that stage. I haven't reached at automating that stage, but at least we know where it has reached. And so we know which node to log into and debug there and see if the tables and all there look correct. And all of this processing and all of this takes, it doesn't even take a minute. All of this can be done in less than a minute. And there is no magic here. It's exactly what the troubleshooting guide asks us to do if there is an issue, but automated. So the purpose here is that, okay, once you run this, check this. If everything is green, go ahead with whatever you want to do with the network. If things are not working, run this, see if there are reds and stuff like that. And then you know where to debug. In a large setup, it's very difficult for you to find out where exactly should I even debug. Don just simply automates the stuff for you. So this is most of the demo that I had. And so this is my final slide before I can take questions. So the takeaway is that this is extensible to any plugin. Very simple stuff. It runs commands, parses commands, analyzes those results, runs further commands based on the results of past, the earlier commands. So this is extensible to any plugin. It's right now for OVS because of its obvious popularity. But I hope others can also provide certain things like this. And we strongly believe this should be part of standard distribution. So we are in the process of actually contacting, say, neutron PTLs and all, and ask them to go through a blueprint process and integrate this with Horizon. You should expect a Horizon tab which will just say, like, diagnose or verify or validate something like that. And it gives you some pictures and say, OK, things are fine. Or please look at this node. This node doesn't seem to be behaving fine. So that's about it. So if you have questions, feel free. Yeah. Would you mind using the mic? I think it's good. OK. Where do we get it? Oh, OK. You can't get it today. We are in the process of outsourcing it. Open sourcing it. Sorry, not outsourcing it. Open sourcing it. So our goal is to get it through the OpenStack releases. But if that takes time, before that, we might have a version available just from our GitHub or something. Oh, OK. Maybe I can go. You can just search for network troubleshooting on OpenStack. It's chapter 12. I know it by heart by now. So, and just to, I think, yeah, this is the stuff. There is a lot of very good information. It gives, basically told me what to automate. So, hey, really nice talk and really cool tool. Thank you for this. I think all the hours spent debugging this stuff. It's great that we can just get a picture. So thank you. The question I have is there are some cases, at least, that I've run into where packets disappear in OBS. You can't run TCP dump on those, you know, when you're inside the switch itself. So is the only way of dealing with that is looking at the flow tables and understanding what's going on there. Yes. Do you have any thoughts for, like, trying to improve that without, you know, someone gaining expertise in open flow and so on? So, for example, this particular OBS test that it did, it did go through all those, like, whether it's learning properly and all those things, right? So, if a particular OBS test succeeds, and even then, if you're looking, if you're finding the packets are being dropped, I right now don't have knowledge how to debug that, right? So these kind of tests, they look into the OBS footprint or the OBS tables, information, whatever, and it actually verifies that it works according to the table, okay? And beyond that, this doesn't do anything. But of course, if you cannot, if a human cannot look at it through some commands and parsing the output and figuring it, oh, this is the reason that this should have matched this, right? That kind of logic is automatable. But if something else goes wrong, then I don't even know how I would debug it as a person. Thank you. So, if something happens like that, then you are in for a real tough time. Hi. Two quick questions. One, do I need Cisco equipment to run this? No. All right. Thank you. And the other one, would this work with alerting, for example, SNMP traps or even emails? So this is written as a, this is written in Django, Python Django, right? So, and there is no Cisco proprietary stuff at all. It's running on my Ubuntu VM on my laptop. There's nothing Cisco made at all. So when you say alerting, do you want it to receive alerts and say something about it or? Send alerts, actually. Oh, send alerts based on when it finds issues. Oh, of course, you can extend it to that. I haven't looked into it. Of course, it can do it, but yeah. I am expecting that if we go via the blueprint process for OpenStack, then we get a more functionality which actually helps the ops guys. This is more like, okay, this is a cool thing to do and things like that, but it really doesn't help the ops. He's not going to come and press validate every now and then, right? He wants an alert or an email sent to him and saying that, okay, something has gone wrong. And an equality alert, as in not a thousand false alarms. Yes, yes, yes. All right. Thank you very much. You tend to get a lot of alerts, but it's a real challenge to actually consolidate it into, say, one root cause or anything. Thank you. Extending on your answer you gave on a troubleshooting drop packet. You might want to look into drop watch. It's a tool that will tell you where exactly a packet has been dropped inside a kernel. And given the issue that you have outlight with the port number is not matching. Yes. What I could imagine is that you're messing up with open flow ports and data path flows. Tracing requires the data path flows and maybe you're comparing to the open flow port numbers. It might be. It might be an explanation why it's showing as a failure. Okay, yeah. I mean, that's what, so Pritesh, he's my kind of obvious, what should I say, obvious oracle. So I asked him. He said he'll look into it and let me know if this is an actual error or am I doing something? Obviously, this is a tool written, right? It might have its own bugs, so we have to refine it as we go along. Any more questions? Well, thank you. Thank you for coming to the talk and surviving till Thursday.